Kernel chapter complete
This commit is contained in:
@@ -149,7 +149,7 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the
|
||||
|
||||
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
||||
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
|
||||
|
||||
\Ac{hbm}-\ac{pim} defines three operating modes:
|
||||
@@ -349,11 +349,11 @@ This interleaving is illustrated in \cref{img:input_vector}.
|
||||
To initialize the input vector in this way, the host processor can use \ac{ab} mode.
|
||||
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
|
||||
|
||||
An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
|
||||
An example with a weight matrix of dimensions (128$\times$8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
|
||||
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as described in \cref{eq:partial_sum}:
|
||||
|
||||
\begin{equation}
|
||||
psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
|
||||
psum[i,0:15]=\sum_{j=0}^{8}(a[j \cdot 16:j \cdot 16+15] \cdot w[i,j \cdot 16:j \cdot 16+15])
|
||||
\label{eq:partial_sum}
|
||||
\end{equation}
|
||||
|
||||
@@ -366,14 +366,14 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.8\linewidth]{images/memory_layout}
|
||||
\caption[Procedure to perform a (128)x(128,8) \ac{gemv} operation]{Procedure to perform a (128)x(128,8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
||||
\caption[Procedure to perform a (128)$\times$(128$\times$8) \ac{gemv} operation]{Procedure to perform a (128)$\times$(128$\times$8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
||||
\label{img:memory_layout}
|
||||
\end{figure}
|
||||
|
||||
In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
|
||||
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
|
||||
Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
|
||||
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
|
||||
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8 \cdot 8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
|
||||
|
||||
\begin{listing}
|
||||
\begin{verbatim}
|
||||
|
||||
Reference in New Issue
Block a user