FIMDRAM Programming Model

2024-02-12 14:06:57 +01:00
parent b554efe3e8
commit 62dbec0a2f
3 changed files with 36 additions and 2 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -235,6 +235,14 @@
    short = ST,
    long = store,
 }
+\DeclareAcronym{tf}{
+    short = TF,
+    long = TensorFlow,
+}
+\DeclareAcronym{isa}{
+    short = ISA,
+    long = instruction set architecture,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -189,6 +189,9 @@ This processing unit architecture is illustrated in Figure \ref{img:pcu}, along
 	\label{img:pcu}
 \end{figure}

+Unlike SK Hynix's Newton architecture, \ac{hbm}-\ac{pim} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
+This makes \ac{hbm}-\ac{pim} less effective for very small workloads as the overhead is significant.
+
 \subsubsection{Instruction Set}

 The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
@@ -290,10 +293,33 @@ JUMP -1, 7
 \end{listing}

 Since the column address of the memory access is incremented after each iteration, all entries of the GRF-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
+The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
 To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
-This memory layout is explained in detail in the following section.
+This memory layout is explained in detail in Section \ref{sec:memory_layout}.
 
+\subsubsection{Programming Model}
+
+The software stack of \ac{hbm}-\ac{pim} is split into three main parts.
+Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as uncacheable.
+It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
+Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
+Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
+At the highest level, \ac{hbm}-\ac{pim} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
+
+The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in section \ref{sec:instruction_ordering}.
+Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
+The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size.
+Such a group of software threads is called a thread group.
+Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hbm} configuration with four memory stacks and a total of 64 \acp{pch}.
+
 \subsubsection{Memory Layout}
 \label{sec:memory_layout}

-\subsubsection{Performance and Power Efficiency Effects}
+\begin{figure}
+	\centering
+	\includegraphics[width=0.8\linewidth]{images/memory_layout}
+	\caption[]{}
+	\label{img:memory_layout}
+\end{figure}
+
+\subsubsection{Performance and Power Efficiency Achievements}
--- a/src/images/memory_layout.pdf
+++ b/src/images/memory_layout.pdf