FIMDRAM Memory Layout

This commit is contained in:
2024-02-12 18:45:27 +01:00
parent 62dbec0a2f
commit 65522a8cfa
4 changed files with 77 additions and 5 deletions

View File

@@ -243,6 +243,10 @@
short = ISA,
long = instruction set architecture,
}
\DeclareAcronym{c}{
short = C,
long = The C Programming Language,
}
\DeclareAcronym{tlm}{
short = TLM,
long = transaction-level modeling,

View File

@@ -88,7 +88,7 @@ While copying, the data layout must be changed to store the data words continuou
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout.
Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
The \acp{dpu} execute compiled C code using a specialized compiler toolchain that provides limited support of the standard library.
The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library.
With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}.
This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
@@ -199,7 +199,7 @@ The data layout of these three instruction groups is shown in Table \ref{tab:isa
\begin{table}
\centering
\includegraphics[width=\linewidth]{images/isa}
\includegraphics[width=0.9\linewidth]{images/isa}
\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
\label{tab:isa}
\end{table}
@@ -276,7 +276,7 @@ With this method, the register indices and the bank address cannot get out of sy
\begin{figure}
\centering
\includegraphics[width=0.5\linewidth]{images/aam}
\caption[Exemplary calculation of the GRF-A and GRF-B index using the row and column address]{Exemplary calculation of the GRF-A and GRF-B index using the row and column address \cite{lee2021}.}
\caption[Exemplary calculation of the \ac{grf}-A and \ac{grf}-B index using the row and column address]{Exemplary calculation of the \ac{grf}-A and \ac{grf}-B index using the row and column address \cite{lee2021}.}
\label{img:aam}
\end{figure}
@@ -292,7 +292,7 @@ JUMP -1, 7
\label{lst:gemv}
\end{listing}
Since the column address of the memory access is incremented after each iteration, all entries of the GRF-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
This memory layout is explained in detail in Section \ref{sec:memory_layout}.
@@ -315,11 +315,59 @@ Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hb
\subsubsection{Memory Layout}
\label{sec:memory_layout}
As already described in Section \ref{sec:instruction_ordering}, the use of the \ac{aam} mode requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
The \ac{hbm}-\ac{pim} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
This does not mean that a matrix row must be the same size as a \ac{dram} row, only that the \ac{am} of the memory controller must switch to the next bank after a complete matrix row.
Once all banks have been accessed, the mapping of the column bits can continue.
The input vector must adhere also a special memory layout.
Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
However, since all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
As there is no communication between the banks, every bank needs to have its own copy of the input vector.
Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
This interleaving is illustrated in Figure \ref{img:input_vector}.
\begin{figure}
\centering
\input{images/input_vector}
\caption[Input vector in linear address space, where one chunk is mapped to all banks]{Input vector in linear address space, where one chunk is mapped to all banks.}
\label{img:input_vector}
\end{figure}
To initialize the input vector in this way, the host processor can use all-bank mode.
From the processor's point of view, only the first bank is initialized, but the all-bank mode ensures that the same data is written to all banks at the same time.
An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as follows:
\begin{equation}
psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
\end{equation}
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
This reduction step is mandatory because there is no means in the \ac{hbm}-\ac{pim} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
The operation of this concrete \ac{gemv} microkernel is illustrated in Figure \ref{img:memory_layout}.
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{images/memory_layout}
\caption[]{}
\caption[Procedure to perform a (128)*(128,8)\ac{gemv} operation]{Procedure to perform a (128)*(128,8)\ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
\label{img:memory_layout}
\end{figure}
In the Figure \ref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
To increase the number of rows in the matrix, simply additional iterations of this 8-cycle microkernel are required, while feeding in the other memory addresses for the subsequent matrix rows.
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums.
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
Therefore, it is necessary to execute the \ac{gemv} microkernel several times with different input vector and weight matrix addresses.
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
\subsubsection{Performance and Power Efficiency Achievements}

Binary file not shown.

View File

@@ -0,0 +1,20 @@
\begin{tikzpicture}
\tiny
\node[draw,outer sep=0,minimum width=1.5cm,fill=TealBlue!30] (inputchunk0) {a[0:127]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=TealBlue!30,right=0 of inputchunk0] (inputchunk1) {a[0:127]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=RoyalBlue!30,right=0 of inputchunk1] (inputchunk2) {a[128:255]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=RoyalBlue!30,right=0 of inputchunk2] (inputchunk3) {a[128:255]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=Blue!30,right=0 of inputchunk3] (inputchunk4) {a[256:383]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=Blue!30,right=0 of inputchunk4] (inputchunk5) {a[256:383]};
\node[draw,outer sep=0,minimum width=1.5cm,fill=Green!30,below=0 of inputchunk0] {Bank 0};
\node[draw,outer sep=0,minimum width=1.5cm,fill=SpringGreen!30,below=0 of inputchunk1] {Bank 1};
\node[draw,outer sep=0,minimum width=1.5cm,fill=Green!30,below=0 of inputchunk2] {Bank 0};
\node[draw,outer sep=0,minimum width=1.5cm,fill=SpringGreen!30,below=0 of inputchunk3] {Bank 1};
\node[draw,outer sep=0,minimum width=1.5cm,fill=Green!30,below=0 of inputchunk4] {Bank 0};
\node[draw,outer sep=0,minimum width=1.5cm,fill=SpringGreen!30,below=0 of inputchunk5] {Bank 1};
\node[right=of inputchunk5.south east,anchor=east] (inputchunk6) {\normalsize\dots};
\end{tikzpicture}