FIMDRAM Conclusion

This commit is contained in:
2024-02-12 19:12:19 +01:00
parent 65522a8cfa
commit 7161e71ae1
2 changed files with 29 additions and 3 deletions

View File

@@ -247,6 +247,10 @@
short = C,
long = The C Programming Language,
}
\DeclareAcronym{fpga}{
short = FPGA,
long = field-programmable gate array,
}
\DeclareAcronym{tlm}{
short = TLM,
long = transaction-level modeling,

View File

@@ -358,16 +358,38 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in Figure \r
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{images/memory_layout}
\caption[Procedure to perform a (128)*(128,8)\ac{gemv} operation]{Procedure to perform a (128)*(128,8)\ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
\caption[Procedure to perform a (128)x(128,8) \ac{gemv} operation]{Procedure to perform a (128)x(128,8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
\label{img:memory_layout}
\end{figure}
In the Figure \ref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
To increase the number of rows in the matrix, simply additional iterations of this 8-cycle microkernel are required, while feeding in the other memory addresses for the subsequent matrix rows.
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums.
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in Listing \ref{lst:gemv64}.
\begin{listing}
\begin{verbatim}
MAC(AAM) GRF_B, BANK, GRF_A
JUMP -1, 63
\end{verbatim}
\caption[The core of a \ac{mac} microkernel that utilizes the maximum number of register entries]{The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.}
\label{lst:gemv64}
\end{listing}
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
Therefore, it is necessary to execute the \ac{gemv} microkernel several times with different input vector and weight matrix addresses.
Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times the different input vector chunks and weight matrix columns.
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
\subsubsection{Performance and Power Efficiency Achievements}
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \ac{hbm}-\ac{pim} to analyze its performance gains and power efficiency improvements.
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \ac{hbm}-\ac{pim} dies and four normal \aca{hbm} dies \cite{lee2021}.
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
The power consumption of the \ac{hbm}-\ac{pim} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
In conclusion, \ac{hbm}-\ac{pim} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
The following Section \ref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.