Kernel chapter complete
This commit is contained in:
@@ -3,6 +3,7 @@
|
|||||||
|
|
||||||
% what to do better:
|
% what to do better:
|
||||||
% implement samsungs real mode switching and programming of crfs
|
% implement samsungs real mode switching and programming of crfs
|
||||||
|
% build an api that guarantees matching LD and ST for the assembled microkernel
|
||||||
% implement linux kernel driver
|
% implement linux kernel driver
|
||||||
% -> alignment requirements -> huge tables
|
% -> alignment requirements -> huge tables
|
||||||
% make use of sasmsung pim in a real dnn application and measure the effects
|
% make use of sasmsung pim in a real dnn application and measure the effects
|
||||||
|
|||||||
@@ -158,9 +158,9 @@ Since different channels would only be used to increase the dimensions of the ma
|
|||||||
\subsubsection{GEMV Microkernel}
|
\subsubsection{GEMV Microkernel}
|
||||||
|
|
||||||
With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
|
With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
|
||||||
The native matrix dimensions of (128,8) have been extended to (128,16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
|
The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
|
||||||
As described in \cref{sec:memory_layout}, the microkernel must therefore execute on both \acp{pch}, which is ensured because when generating the \ac{rd} and \ac{wr} commands for the matrix addresses, the respective \ac{pch} is implicitly addressed.
|
The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses.
|
||||||
With the (128,16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
|
With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
|
||||||
|
|
||||||
\begin{listing}
|
\begin{listing}
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
@@ -189,4 +189,13 @@ The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, l
|
|||||||
|
|
||||||
\subsubsection{Benchmark Environment}
|
\subsubsection{Benchmark Environment}
|
||||||
|
|
||||||
One crucial missing piece to measure the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
|
One crucial missing piece for measuring the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
|
||||||
|
The gem5 simulator reports this number of ticks and other statistics in a file at the end of the simulation.
|
||||||
|
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
|
||||||
|
This can be achieved using the so-called M5ops.
|
||||||
|
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
|
||||||
|
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
|
||||||
|
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
|
||||||
|
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
|
||||||
|
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization.
|
||||||
|
This provides a fair basis for comparison and allows a number of comparative simulations to be performed.
|
||||||
|
|||||||
@@ -69,7 +69,7 @@ For such a flat array, several things have to be considered:
|
|||||||
\item The start of the array must lie on the first bank of the \ac{pch} and the end of the array must lie on the last bank of the \ac{pch}.
|
\item The start of the array must lie on the first bank of the \ac{pch} and the end of the array must lie on the last bank of the \ac{pch}.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
The software library introduces the \texttt{BankArray} data structure, which has the size of $\qty{32}{\byte}*\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, holding in total 256 \ac{fp16} numbers.
|
The software library introduces the \texttt{BankArray} data structure, which has the size of $\qty{32}{\byte}\cdot\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, holding in total 256 \ac{fp16} numbers.
|
||||||
To guarantee the correct placement, an alignment of 512 is explicitly enforced.
|
To guarantee the correct placement, an alignment of 512 is explicitly enforced.
|
||||||
While it may seem at first that the compiler implicitly enforces this alignment, this is not true for arrays, consisting of smaller data types - the compiler only enforces a $\qty{2}{\byte}$ alignment for the \ac{fp16} array, since a \ac{fp16} number is $\qty{2}{\byte}$ in size.
|
While it may seem at first that the compiler implicitly enforces this alignment, this is not true for arrays, consisting of smaller data types - the compiler only enforces a $\qty{2}{\byte}$ alignment for the \ac{fp16} array, since a \ac{fp16} number is $\qty{2}{\byte}$ in size.
|
||||||
This memory layout assumes a bank interleaving \ac{am}, where after a complete burst the memory controller addresses the next bank of the \ac{pch}.
|
This memory layout assumes a bank interleaving \ac{am}, where after a complete burst the memory controller addresses the next bank of the \ac{pch}.
|
||||||
@@ -95,7 +95,7 @@ Following the same consideration as with the \texttt{BankArray}, the weight matr
|
|||||||
However, when using the \ac{aam} execution mode, this is not sufficient.
|
However, when using the \ac{aam} execution mode, this is not sufficient.
|
||||||
As already shown in \cref{img:aam}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
|
As already shown in \cref{img:aam}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
|
||||||
With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
|
With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
|
||||||
Therefore, the larger alignment requirement of $2^6*\qty{512}{\byte}=\qty{32768}{\byte}$ must be made for the weight matrix.
|
Therefore, the larger alignment requirement of $2^6\cdot\qty{512}{\byte}=\qty{32768}{\byte}$ must be made for the weight matrix.
|
||||||
|
|
||||||
Besides the weight matrices, the input vector must adhere an interleaved layout at the granularity of the 16-wide \ac{fp16} vector, as described in \cref{sec:memory_layout}.
|
Besides the weight matrices, the input vector must adhere an interleaved layout at the granularity of the 16-wide \ac{fp16} vector, as described in \cref{sec:memory_layout}.
|
||||||
The number of the copies of each chunk is equal to the number of processing units in each \ac{pch}.
|
The number of the copies of each chunk is equal to the number of processing units in each \ac{pch}.
|
||||||
@@ -120,7 +120,7 @@ Since a memory request triggers the execution of all processing units in a \ac{p
|
|||||||
From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
|
From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
|
||||||
This requires special indexing of the input vectors and matrices, since they must be accessed very sparsely.
|
This requires special indexing of the input vectors and matrices, since they must be accessed very sparsely.
|
||||||
|
|
||||||
In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}*\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
|
In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
|
||||||
This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
|
This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
|
||||||
To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
|
To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
|
||||||
Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
|
Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
|
||||||
|
|||||||
@@ -149,7 +149,7 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the
|
|||||||
|
|
||||||
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
||||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
||||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
||||||
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
|
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
|
||||||
|
|
||||||
\Ac{hbm}-\ac{pim} defines three operating modes:
|
\Ac{hbm}-\ac{pim} defines three operating modes:
|
||||||
@@ -349,11 +349,11 @@ This interleaving is illustrated in \cref{img:input_vector}.
|
|||||||
To initialize the input vector in this way, the host processor can use \ac{ab} mode.
|
To initialize the input vector in this way, the host processor can use \ac{ab} mode.
|
||||||
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
|
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
|
||||||
|
|
||||||
An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
|
An example with a weight matrix of dimensions (128$\times$8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
|
||||||
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as described in \cref{eq:partial_sum}:
|
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as described in \cref{eq:partial_sum}:
|
||||||
|
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
|
psum[i,0:15]=\sum_{j=0}^{8}(a[j \cdot 16:j \cdot 16+15] \cdot w[i,j \cdot 16:j \cdot 16+15])
|
||||||
\label{eq:partial_sum}
|
\label{eq:partial_sum}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
|
|
||||||
@@ -366,14 +366,14 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img
|
|||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics[width=0.8\linewidth]{images/memory_layout}
|
\includegraphics[width=0.8\linewidth]{images/memory_layout}
|
||||||
\caption[Procedure to perform a (128)x(128,8) \ac{gemv} operation]{Procedure to perform a (128)x(128,8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
\caption[Procedure to perform a (128)$\times$(128$\times$8) \ac{gemv} operation]{Procedure to perform a (128)$\times$(128$\times$8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
||||||
\label{img:memory_layout}
|
\label{img:memory_layout}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
|
In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
|
||||||
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
|
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
|
||||||
Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
|
Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
|
||||||
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
|
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8 \cdot 8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
|
||||||
|
|
||||||
\begin{listing}
|
\begin{listing}
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
|
|||||||
@@ -43,7 +43,7 @@
|
|||||||
$w_{4,0}$ & $w_{4,1}$ & $w_{4,2}$ & $w_{4,3}$ \\
|
$w_{4,0}$ & $w_{4,1}$ & $w_{4,2}$ & $w_{4,3}$ \\
|
||||||
};
|
};
|
||||||
|
|
||||||
\node (prod) [right=4mm of matrix] {$*$};
|
\node (prod) [right=4mm of matrix] {$\times$};
|
||||||
|
|
||||||
\matrix (input_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=prod] {
|
\matrix (input_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=prod] {
|
||||||
$i_{0}$ \\
|
$i_{0}$ \\
|
||||||
|
|||||||
@@ -30,7 +30,7 @@
|
|||||||
|
|
||||||
\node[above=0mm of bank2] {$\iddots$};
|
\node[above=0mm of bank2] {$\iddots$};
|
||||||
|
|
||||||
\node (prod) [right=of bank0] {$*$};
|
\node (prod) [right=of bank0] {$\times$};
|
||||||
|
|
||||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of prod] (input) {};
|
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of prod] (input) {};
|
||||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!20,anchor=north] (inputchunk0) at (input.north) {};
|
\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!20,anchor=north] (inputchunk0) at (input.north) {};
|
||||||
|
|||||||
Reference in New Issue
Block a user