FIMDRAM Instruction Ordering

2024-02-11 20:38:26 +01:00
parent af4e559006
commit b554efe3e8
4 changed files with 85 additions and 23 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -227,6 +227,14 @@
    short = AAM,
    long = address aligned mode,
 }
+\DeclareAcronym{ld}{
+    short = LD,
+    long = load,
+}
+\DeclareAcronym{st}{
+    short = ST,
+    long = store,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -125,10 +125,12 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
 \subsection{FIMDRAM/HBM-PIM}
 \label{sec:pim_fim}

-\subsubsection{Architecture}
-
 One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
-As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
+As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
+The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the subsection \ref{sec:memory_layout} being mainly based on \cite{kang2022}.
+
+\subsubsection{Architecture}
+As the name of \ac{hbm}-\ac{pim} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
 A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
 Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
 Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
@@ -167,16 +169,18 @@ Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwid

 \subsubsection{Processing Unit}

-Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}, as well as the good support of \ac{fp16} for modern processor architectures.
+Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
+In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
 The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
 In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
-The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when accessing memory.
-As explained earlier, the operands come either directly from the bank or from the \acp{srf} or \acp{grf}.
-Each \ac{grf} consists of 16 256-bit registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
+The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
+One program that is stored in the \ac{crf} is called a \textit{microkernel}.
+As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}.
+Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
 The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
 Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
 It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
-This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd banks, and the control unit that, among other things, decodes the instructions and keeps track of the program counter.
+This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.

 \begin{figure}
 	\centering
@@ -198,10 +202,12 @@ The data layout of these three instruction groups is shown in Table \ref{tab:isa
 \end{table}

 For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
+It is important to note that the JUMP instruction is a zero-cycle instruction, i.e. it is executed together with the instruction that precedes it.
 The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
 Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.

-The DST and SRC fields specify the operand type, i.e., the register file or bank affected by the operation.
+The DST and SRC fields specify the operand type.
+That is, the register file or bank affected by the operation.
 Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols.
 The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
 Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}.
@@ -222,11 +228,11 @@ Another special field \textit{A} enabled the \ac{aam}, which will be explained i
 	Data       & MOV     & {move data\\from bank/register\\to register}      & GRF, SRF     & GRF, BANK      &                &                \\
 	Data       & FILL    & {move data\\from bank/register\\to bank/register} & GRF, BANK    & GRF, BANK      &                &                \\
 	Arithmetic & ADD     & addition                                          & GRF          & GRF, BANK, SRF & GRF, BANK, SRF &                \\
-	Arithmetic & MUL     & multiplication                                    & GRF          & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
+	Arithmetic & MUL     & multiplication                                    & GRF          & GRF, BANK      & GRF, BANK, SRF &                \\
 	Arithmetic & MAC     & multiply-accumulate                               & GRF-B        & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
-	Arithmetic & MAD     & multiply-and-add                                  & GRF          & GRF, BANK      & GRF, BANK, SRF &                
+	Arithmetic & MAD     & multiply-and-add                                  & GRF          & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF  
 	\end{tblr}}
-	\caption[A list of the supported instructions with possible sources and destinations]{A list of the supported instructions with possible sources and destinations \cite{shin-haengkang2023}.}
+	\caption[A list of the supported instructions their possible sources and destinations]{A list of the supported instructions their possible sources and destinations \cite{shin-haengkang2023}.}
 	\label{tab:instruction_set}
 \end{table}

@@ -234,12 +240,60 @@ The Table \ref{tab:instruction_set} gives an overview of all available instructi
 It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
 For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
 Likewise, reading from the banks, requires a \ac{rd} command.
-For the control types and arithmetic instructions without the bank as a source operand, an either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
+For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
 The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.

 \subsubsection{Instruction Ordering}
 \label{sec:instruction_ordering}

+Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
+When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
+This means that the order of the respective memory accesses for such instructions must not be reordered, as it must match the corresponding instruction in the microkernel.
+For example, as shown in Listing \ref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for the actual memory access to get the row and column address of the bank access.
+
+\begin{listing}
+\begin{verbatim}
+MAC GRF_B #0, BANK, GRF_A #0
+MAC GRF_B #1, BANK, GRF_A #1
+\end{verbatim}
+	\caption[Exemplary sequence of \ac{mac} instructions in a microkernel]{Exemplary sequence of \ac{mac} instructions in a microkernel.}
+	\label{lst:reorder}
+\end{listing}
+
+Unfortunately, the memory controller between the host processor and the \ac{pim} memory is allowed to reorder memory fetches as long as they do not introduce hazards.
+This causes the register sources and destinations to be out of sync with the bank addresses.
+One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, as only one memory transaction is handled by the controller at a time.
+However, this comes at a significant performance cost and results in memory bandwidth being underutilized as the host processor has to wait for every memory access to complete.
+Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.
+
+To solve this overhead, Samsung has implemented the \ac{aam} mode for arithmetic instructions.
+In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in Figure \ref{img:aam}.
+With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
+
+\begin{figure}
+	\centering
+	\includegraphics[width=0.5\linewidth]{images/aam}
+	\caption[Exemplary calculation of the GRF-A and GRF-B index using the row and column address]{Exemplary calculation of the GRF-A and GRF-B index using the row and column address \cite{lee2021}.}
+	\label{img:aam}
+\end{figure}
+
+As a side effect, this method also allows looping of an instruction in the microkernel, as otherwise the indices are always fixed and would therefore apply to the same register entry each time.
+At the core of a \ac{gemv} microkernel is an iterative \ac{mac} instruction, followed by a JUMP instruction that executes the \ac{mac} operation a total of eight times, as shown in Listing \ref{lst:gemv}.
+
+\begin{listing}
+\begin{verbatim}
+MAC(AAM) GRF_B, BANK, GRF_A
+JUMP -1, 7
+\end{verbatim}
+	\caption[The core of a \ac{gemv} microkernel]{The core of a \ac{gemv} microkernel.}
+	\label{lst:gemv}
+\end{listing}
+
+Since the column address of the memory access is incremented after each iteration, all entries of the GRF-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
+To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
+This memory layout is explained in detail in the following section.
+ 
 \subsubsection{Memory Layout}
+\label{sec:memory_layout}

 \subsubsection{Performance and Power Efficiency Effects}
--- a/src/images/aam.pdf
+++ b/src/images/aam.pdf
--- a/src/index.tex
+++ b/src/index.tex
@@ -106,16 +106,16 @@

 % %List of Listings
 % %\renewcommand{\lstlistlistingname}{Verzeichnis der Quellcodes}
-% \begingroup
-%     \phantomsection
-%     \addcontentsline{toc}{section}{List of Listings}
-%       \setlength{\itemsep}{20pt}
-%   \setlength{\parskip}{10pt}
-%     \renewcommand{\listlistingname}{List of Listings}
-%     \listoflistings 
-% \endgroup
-% \newpage
-% \clearpage
+\begingroup
+    \phantomsection
+    \addcontentsline{toc}{section}{List of Listings}
+      \setlength{\itemsep}{20pt}
+  \setlength{\parskip}{10pt}
+    \renewcommand{\listlistingname}{List of Listings}
+    \listoflistings 
+\endgroup
+\newpage
+\clearpage

 % List of Abbreviations
 % TODO näher beieinander und serifen für abkürzungen