Data Structures

This commit is contained in:
2024-02-15 14:49:08 +01:00
parent 8c655cc993
commit 88d46788c7
8 changed files with 139 additions and 28 deletions

View File

@@ -317,15 +317,20 @@ Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hb
\label{sec:memory_layout}
As already described in \cref{sec:instruction_ordering}, the use of the \ac{aam} mode requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, while adhering a column-major matrix layout.
In a column-major matrix layout, the entries of a column are stored sequentially before switching to the next column, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
However, the concrete element type of the array is not a single \ac{fp16} number, but a vector of 16 \ac{fp16} numbers packed together.
This results in 16 \ac{fp16} matrix row elements being stored sequentially before switching to the next 16 \ac{fp16} elements in the next row of the same 16 columns, ensuring that a \ac{simd} processing unit always contains the data of only one matrix row.
The \aca{fimdram} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
The \aca{fimdram} architecture also imposes certain dimensional constraints on the weight matrix and the input vector.
As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
This does not mean that a matrix row must be the same size as a \ac{dram} row, only that the \ac{am} of the memory controller must switch to the next bank after a complete matrix row.
Once all banks have been accessed, the mapping of the column bits can continue.
% Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
% This does not mean that a matrix row must be the same size as a \ac{dram} row, only that the \ac{am} of the memory controller must switch to the next bank after a complete matrix row.
% Once all banks have been accessed, the mapping of the column bits can continue.
Furthermore, the number of columns defines the number of iterations the \ac{mac} core of the microkernel has to perform.
As always 16 \ac{fp16} elements are packed together in a column-major fashion, and while ensuring that the \ac{am} of the memory controller switches to the next bank after exactly one burst size, the \ac{pim} units each contain 16 different matrix row elements of the same set of 16 matrix columns.
Note, that this interleaving of \ac{fp16} vectors is very similar to the chunking of the weight matrix of SK Hynix's Newton architecture, as illustrated in \cref{img:hynix}.
The input vector must adhere also a special memory layout.
Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
@@ -345,15 +350,17 @@ To initialize the input vector in this way, the host processor can use \ac{ab} m
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as follows:
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as described in \cref{eq:partial_sum}:
\begin{equation}
psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
\label{eq:partial_sum}
\end{equation}
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
Note that consequently the activation function often used in \acp{dnn}, i.e. \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function.
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
\begin{figure}
@@ -365,8 +372,8 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img
In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
To increase the number of rows in the matrix, simply additional iterations of this 8-cycle microkernel are required, while feeding in the other memory addresses for the subsequent matrix rows.
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
\begin{listing}
\begin{verbatim}