Samsung PIM Architecture and Instructions

2024-02-08 22:15:12 +01:00
parent c895d71f74
commit af4e559006
7 changed files with 134 additions and 11 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -211,6 +211,22 @@
    short = SRF,
    long = scalar register file,
 }
+\DeclareAcronym{fp16}{
+    short = FP16,
+    long = 16-bit floating-point,
+}
+\DeclareAcronym{fp32}{
+    short = FP32,
+    long = 32-bit floating-point,
+}
+\DeclareAcronym{relu}{
+    short = ReLU,
+    long = rectified linear unit,
+}
+\DeclareAcronym{aam}{
+    short = AAM,
+    long = address aligned mode,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -111,7 +111,7 @@ For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integrat

 One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
 The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
-With a $t_{CK}$ of $\qty{1}{\giga\hertz}$ \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, resulting in $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
+With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
 A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration.
 In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
 Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
@@ -123,7 +123,7 @@ In the center of the die, the \acp{tsv} connect to the next die above or the pre
 \begin{figure}
 	\centering
 	\includegraphics[width=0.8\linewidth]{images/hbm}
-	\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}}
+	\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}.}
 	\label{img:hbm}
 \end{figure}

--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -42,14 +42,14 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate
 \label{sec:pim_architectures}

 Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
-These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level. 
+These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
 In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:

 \begin{enumerate}
-\item Inside the memory \ac{subarray}.
-\item In the \ac{psa} region near a \ac{subarray}.
-\item Outside the bank in its peripheral region.
-\item In the \ac{io} region of the memory.
+	\item Inside the memory \ac{subarray}.
+	\item In the \ac{psa} region near a \ac{subarray}.
+	\item Outside the bank in its peripheral region.
+	\item In the \ac{io} region of the memory.
 \end{enumerate}

 Each of these approaches comes with different advantages and disadvantages.
@@ -109,7 +109,7 @@ To make full use of the output buffering, the matrix rows are interleaved in an
 \begin{figure}
 	\centering
 	\input{images/hynix}
-	\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}}
+	\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}.}
 	\label{img:hynix}
 \end{figure}

@@ -125,6 +125,8 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
 \subsection{FIMDRAM/HBM-PIM}
 \label{sec:pim_fim}

+\subsubsection{Architecture}
+
 One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
 As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
 A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
@@ -134,17 +136,110 @@ Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm}
 At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
 They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
 This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
+
 \begin{figure}
 	\centering
 	\includegraphics[width=\linewidth]{images/fimdram}
-	\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
+	\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.}
 	\label{img:fimdram}
 \end{figure}
+
 As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
 The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
 As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
 As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.

+\Ac{hbm}-\ac{pim} defines three operating modes:
+\begin{enumerate}
+	\item \textbf{Single Bank Mode}:
+	      This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory.
+	      To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
+	\item \textbf{All-Bank Mode}:
+	      The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
+	      This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
+	\item \textbf{All-Bank-\ac{pim} Mode}:
+	      With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
+	      In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
+	      In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
+\end{enumerate}

-% unterschiede zu hynix pim
-% benchmark ergebnisse von samsung...
+Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
+
+\subsubsection{Processing Unit}
+
+Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}, as well as the good support of \ac{fp16} for modern processor architectures.
+The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
+In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
+The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when accessing memory.
+As explained earlier, the operands come either directly from the bank or from the \acp{srf} or \acp{grf}.
+Each \ac{grf} consists of 16 256-bit registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
+The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
+Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
+It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
+This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd banks, and the control unit that, among other things, decodes the instructions and keeps track of the program counter.
+
+\begin{figure}
+	\centering
+	\includegraphics[width=0.8\linewidth]{images/pcu}
+	\caption[Architecture of a \ac{pim} processing unit]{Architecture of a \ac{pim} processing unit \cite{lee2021}.}
+	\label{img:pcu}
+\end{figure}
+
+\subsubsection{Instruction Set}
+
+The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
+The data layout of these three instruction groups is shown in Table \ref{tab:isa}.
+
+\begin{table}
+	\centering
+	\includegraphics[width=\linewidth]{images/isa}
+	\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
+	\label{tab:isa}
+\end{table}
+
+For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
+The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
+Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.
+
+The DST and SRC fields specify the operand type, i.e., the register file or bank affected by the operation.
+Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols.
+The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
+Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}.
+
+\begin{table}
+	\centering
+	\resizebox{\linewidth}{!}{%
+	\begin{tblr}{
+	  hlines,
+	  vlines,
+	  hline{2} = {-}{solid,black},
+	  hline{2} = {2}{-}{solid,black},
+	}
+	Type       & Command & Description                                       & Result (DST) & Operand (SRC0) & Operand (SRC1) & Operand (SRC2) \\
+	Control    & NOP     & no operation                                      &              &                &                &                \\
+	Control    & JUMP    & jump instruction                                  &              &                &                &                \\
+	Control    & EXIT    & exit instruction                                  &              &                &                &                \\
+	Data       & MOV     & {move data\\from bank/register\\to register}      & GRF, SRF     & GRF, BANK      &                &                \\
+	Data       & FILL    & {move data\\from bank/register\\to bank/register} & GRF, BANK    & GRF, BANK      &                &                \\
+	Arithmetic & ADD     & addition                                          & GRF          & GRF, BANK, SRF & GRF, BANK, SRF &                \\
+	Arithmetic & MUL     & multiplication                                    & GRF          & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
+	Arithmetic & MAC     & multiply-accumulate                               & GRF-B        & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
+	Arithmetic & MAD     & multiply-and-add                                  & GRF          & GRF, BANK      & GRF, BANK, SRF &                
+	\end{tblr}}
+	\caption[A list of the supported instructions with possible sources and destinations]{A list of the supported instructions with possible sources and destinations \cite{shin-haengkang2023}.}
+	\label{tab:instruction_set}
+\end{table}
+
+The Table \ref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
+It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
+For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
+Likewise, reading from the banks, requires a \ac{rd} command.
+For the control types and arithmetic instructions without the bank as a source operand, an either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
+The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.
+
+\subsubsection{Instruction Ordering}
+\label{sec:instruction_ordering}
+
+\subsubsection{Memory Layout}
+
+\subsubsection{Performance and Power Efficiency Effects}
--- a/src/doc.bib
+++ b/src/doc.bib
@@ -446,6 +446,16 @@
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3J45PFD2/Seshadri und Mutlu - 2020 - In-DRAM Bulk Bitwise Execution Engine.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DTK64DHZ/1905.html}
 }

+@misc{shin-haengkang2023,
+  title = {{{PIMSimulator}}},
+  author = {{Shin-haeng Kang} and {Sanghoon Cha} and {Seungwoo Seo} and {Jin-seong Kim}},
+  year = {2023},
+  month = nov,
+  url = {https://github.com/SAITPublic/PIMSimulator},
+  urldate = {2024-02-08},
+  abstract = {Processing-In-Memory (PIM) Simulator}
+}
+
@misc{src2021,
  title = {Decadal {{Plan}} for {{Semiconductors}}},
  author = {{SRC}},
--- a/src/images/isa.pdf
+++ b/src/images/isa.pdf
--- a/src/images/pcu.pdf
+++ b/src/images/pcu.pdf
--- a/src/index.tex
+++ b/src/index.tex
@@ -22,6 +22,8 @@
 \usepackage{pgfplots}
 \usepackage{bytefield}
 \usepackage{mathdots}
+\usepackage{tabularray}
+\usepackage{makecell}

 % Configurations
 \usetikzlibrary{matrix}