Benchmarks and part of results

This commit is contained in:
2024-03-11 11:13:24 +01:00
parent 1ee68b2e01
commit 29de1fc642
7 changed files with 192 additions and 5 deletions

View File

@@ -123,3 +123,15 @@
short = AAM,
long = address aligned mode,
}
\DeclareAcronym{mac}{
short = MAC,
long = multiply-accumulate,
}
\DeclareAcronym{haxpy}{
short = HAXPY,
long = half precision $a \cdot x + y$,
}
\DeclareAcronym{relu}{
short = ReLU,
long = rectified linear unit,
}

30
plots/matrix_infinite.tex Normal file
View File

@@ -0,0 +1,30 @@
\begin{tikzpicture}
\pgfplotstableread[col sep=comma]{plots/speedup_tables/matrix.csv}\csv
\begin{axis}[
width=5cm,
height=4cm,
ybar=1pt,
bar width = 5pt,
ymin=0,
ymax=20,
ymajorgrids,
ylabel={Relative Performance},
tick pos=left,
xtick=data,
xticklabels from table={\csv}{level},
enlarge x limits=0.25,
legend style={
at={(current bounding box.south-|current axis.south)},
anchor=north,
legend columns=-1,
draw=none,
/tikz/every even column/.append style={column sep=0.5cm}
},
]
\addplot[fill=_blue!90] table [x expr=\coordindex, y={gemv}]{\csv};
\addlegendentry{GEMV}
\addplot[fill=_orange!90] table [x expr=\coordindex, y={dnn}]{\csv};
\addlegendentry{DNN}
\end{axis}
\end{tikzpicture}

View File

@@ -0,0 +1,5 @@
level,gemv,dnn
X1,8.725110246753701,0.5853017926410354
X2,8.926639006288317,3.6909334536122427
X3,9.010099560986427,5.380703318160134
X4,9.208111243015697,6.012517728019782
1 level gemv dnn
2 X1 8.725110246753701 0.5853017926410354
3 X2 8.926639006288317 3.6909334536122427
4 X3 9.010099560986427 5.380703318160134
5 X4 9.208111243015697 6.012517728019782

View File

@@ -0,0 +1,4 @@
import polars as pl
print(pl.read_csv("vector.csv").describe())
print(pl.read_csv("matrix.csv").describe())

View File

@@ -0,0 +1,5 @@
level,vadd,vmul,haxpy
X1,12.912945086743383,10.707228337727948,17.57341416054572
X2,12.657264796496554,10.41017271260676,17.530771651728568
X3,12.858101352840125,10.179728788420332,17.287022013303083
X4,12.5175927651105,10.158740110546228,17.568375657167437
1 level vadd vmul haxpy
2 X1 12.912945086743383 10.707228337727948 17.57341416054572
3 X2 12.657264796496554 10.41017271260676 17.530771651728568
4 X3 12.858101352840125 10.179728788420332 17.287022013303083
5 X4 12.5175927651105 10.158740110546228 17.568375657167437

33
plots/vector_infinite.tex Normal file
View File

@@ -0,0 +1,33 @@
\begin{tikzpicture}
\pgfplotstableread[col sep=comma]{plots/speedup_tables/vector.csv}\csv
\begin{axis}[
width=5cm,
height=4cm,
ybar=1pt,
bar width = 5pt,
ymin=0,
ymax=20,
ymajorgrids,
ylabel={Relative Performance},
tick pos=left,
xtick=data,
xticklabels from table={\csv}{level},
enlarge x limits=0.25,
legend style={
at={(current bounding box.south-|current axis.south)},
anchor=north,
legend columns=-1,
draw=none,
/tikz/every even column/.append style={column sep=0.5cm}
},
]
\addplot[fill=_blue!90] table [x expr=\coordindex, y={vadd}]{\csv};
\addlegendentry{VADD}
\addplot[fill=_orange!90] table [x expr=\coordindex, y={vmul}]{\csv};
\addlegendentry{VMUL}
\addplot[fill=_yellow!90] table [x expr=\coordindex, y={haxpy}]{\csv};
\addlegendentry{HAXPY}
\end{axis}
\end{tikzpicture}

View File

@@ -10,6 +10,9 @@
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{tikz}
\usepackage{mathdots}
\usepackage{tabularray}
\usepackage{pgfplotstable}
\usepackage{subfig}
\usepackage{graphicx}
% Used for displaying a sample figure. If possible, figure files should
@@ -130,9 +133,9 @@ In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
One program that is stored in the \ac{crf} is called a \textit{microkernel}.
Each \ac{grf} consists of 16 registers, each with the \aca{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated $\qty{16}{\times}$ as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each.
The \aca{fimdram} instruction set provides a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions (NOP, JUMP, EXIT), arithmetic instructions (ADD, MUL, MAC, MAD) and data movement instructions (MOV, FILL).
Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
@@ -168,7 +171,11 @@ Note that while the MAC instruction can iteratively add to the same destination
Instead it is the host processor's responsibility of reducing these 16 floating point numbers into one \ac{fp16} number.
With this implementation of the processing units, it is now possible to write a user program that controls the execution of the \ac{pim}-\acp{pu} directly in the \ac{hbm2} model.
% TODO software library...
To ease the process of using \ac{pim}, a software library is provided, which takes care of the following:
It implements the \textbf{mode switching} logic, that switches between \ac{sb}, \ac{ab} and \ac{abp} modes.
For the programming of the \textbf{microkernels}, the library provides data structures for their assembly and transfer to the \ac{pim} units.
Data structures are also provided for the layout of the input operands in a \ac{pim}-specific \textbf{memory layout}.
After mode switching and programming of the microkernel, the library implements functionality to \textbf{execute a user-defined microkernel} by issuing the necessary memory requests through the execution of \ac{ld} and \ac{st} instructions.
The use of \ac{aam} requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
The memory layout of a weight matrix used for e.g., a \ac{gemv} operation is illustrated in \cref{img:matrix_layout}.
@@ -187,11 +194,102 @@ To guarantee the correct placement of the first matrix element at the boundary o
However, when using the \ac{aam} execution mode, this is not sufficient.
As already mentioned in \cref{sec:dram_pim}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
Therefore, the larger alignment requirement of $2^6 \cdot \qty{512}{\byte} = \qty{32768}{\byte}$ must be ensured for the weight matrix.
Therefore, the larger alignment requirement of ${2^6 \cdot \qty{512}{\byte} = \qty{32768}{\byte}}$ must be ensured for the weight matrix.
The operand initialization, the host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions.
When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand.
Further, when data is read from or written to the memory banks, these memory requests are issued with the correct address for the data.
As half the banks in a \ac{pch} operate at the same time, from the viewpoint of the host processor, the data accesses occur very sparsely.
In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of $\qty{8}{\times}$.
To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
As the input address of the weight matrix grows, the \ac{grf}-A and \ac{grf}-B indices are incremented in such a way that the \ac{grf}-A registers are read repeatedly to multiply the weights by the input vector, while the \ac{grf}-B registers are incremented in the outer loop to hold the results of additional matrix rows.
Besides generating memory requests, an important task of the software library is to maintain the data coherence of the program.
The compiler may introduce invariants with respect to the value of the output vector, since it does not see that the value of the vector has changed without the host explicitly writing to it.
As a result, the compiler may make optimizations that are not obvious to the programmer, such as reordering memory accesses, that cause the program to execute incorrectly.
To avoid this, not only between non-\ac{aam} instructions in the microkernel, the processor must introduce memory barriers after initializing the input operands and before reading the output vector, to ensure that all memory accesses and \ac{pim} operations are completed.
\section{Simulations}
Our simulations are based on the gem5 simulator and the DRAMSys memory simulator.
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity.
In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered.
This provides a lower bound on the possible speedups achieved by \ac{pim}, independent of the host architecture.
The configuration of \ac{hbm2} \ac{dram} is summarized in \cref{tab:memspec}.
\begin{table}
\centering
\begin{tblr}{
hlines,
vlines,
column{3} = {r},
row{1} = {l},
hline{2} = {2}{-}{solid,black},
}
Parameter & Description & Value \\
Number of Bank Groups & Bank Groups per \ac{pch} & 4 \\
Number of Banks & Banks per \ac{pch} & 16 \\
Number of \acp{pch} & \acp{pch} per Channel & 2 \\
Number of Channels & Total Number of Channels & 1 \\
Number of Columns & Columns per Memory Array & 128 \\
Number of Rows & Rows per Memory Array & 65536 \\
Width & Width of the Data Bus & 64
\end{tblr}
\caption{The configuration of \ac{hbm2}.}
\label{tab:memspec}
\end{table}
Our benchmarks are divided into two classes: vector benchmarks, which perform level 1 \ac{blas}-like operations, and matrix-vector benchmarks, which perform level 2 \ac{blas} operations.
Both classes of benchmarks are typically memory-bound, since little or no data is reused during the operation.
For the first class of benchmarks, two \ac{fp16} vectors are added (VADD), multiplied (VMUL), or combined in a \ac{haxpy} fashion.
The second class of benchmarks performs a \ac{gemv} matrix-vector multiplication or models a simple fully connected neural network with multiple layers and applying the activation function \ac{relu} in between.
Each benchmark is executed with variable operand dimensions, which are listed in \cref{tab:dimensions}.
\begin{table}
\centering
\begin{tblr}{
hlines,
vlines,
column{1} = {c},
column{2} = {r},
column{3} = {r},
column{4} = {r},
row{1} = {l},
hline{2} = {2}{-}{solid,black},
}
Level & Vector & \ac{gemv} & \ac{dnn} \\
X1 & 2M & (1024 $\times$ 4096) & (256 $\times$ 256) \\
X2 & 4M & (2048 $\times$ 4096) & (512 $\times$ 512) \\
X3 & 8M & (4096 $\times$ 8192) & (1024 $\times$ 1024) \\
X4 & 16M & (8192 $\times$ 8192) & (2048 $\times$ 2048)
\end{tblr}
\caption{Input operand dimensions.}
\label{tab:dimensions}
\end{table}
The benchmarks focus lies on the achievable performance gain of \ac{pim}.
In each run simulation, the relative performance (speedup) of \ac{pim} compared to non-\ac{pim} is analyzed.
\section{Results}
The results in \cref{fig:speedups} show significant speedups for all vector benchmarks in all simulated operand dimensions, with the following average values: $\qty{12.7}{\times}$ for VADD, $\qty{10.4}{\times}$ for VMUL and $\qty{17.5}{\times}$ for \ac{haxpy}.
On the other hand, the achieved speedup for the matrix-vector simulations varied with the simulated operand dimensions.
The \ac{gemv} benchmark achieved a speedup in the range $\qtyrange{8.7}{9.2}{\times}$ with an average value of $\qty{9.0}{\times}$, while the fully connected neural network layers experienced a higher variance:
With a range of $\qtyrange{0.6}{6.0}{\times}$, the \ac{dnn} benchmark experienced both a slowdown and an acceleration of the inference time.
Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pim} can be expected to be viable.
\begin{figure}
\centering
\subfloat[\centering Vector Benchmarks]{{\input{plots/vector_infinite}}}
\subfloat[\centering Matrix-Vector Benchmarks]{{\input{plots/matrix_infinite}}}
\caption{Comparison between non-\ac{pim} and \ac{pim}.}
\label{fig:speedups}
\end{figure}
Vergleich mit Samsung...
% TODO Derek
\section{Conclusion}
% TODO Lukas/Matthias
%