FIMDRAM Conclusion
This commit is contained in:
@@ -247,6 +247,10 @@
|
||||
short = C,
|
||||
long = The C Programming Language,
|
||||
}
|
||||
\DeclareAcronym{fpga}{
|
||||
short = FPGA,
|
||||
long = field-programmable gate array,
|
||||
}
|
||||
\DeclareAcronym{tlm}{
|
||||
short = TLM,
|
||||
long = transaction-level modeling,
|
||||
|
||||
@@ -358,16 +358,38 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in Figure \r
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.8\linewidth]{images/memory_layout}
|
||||
\caption[Procedure to perform a (128)*(128,8)\ac{gemv} operation]{Procedure to perform a (128)*(128,8)\ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
||||
\caption[Procedure to perform a (128)x(128,8) \ac{gemv} operation]{Procedure to perform a (128)x(128,8) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
|
||||
\label{img:memory_layout}
|
||||
\end{figure}
|
||||
|
||||
In the Figure \ref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
|
||||
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
|
||||
To increase the number of rows in the matrix, simply additional iterations of this 8-cycle microkernel are required, while feeding in the other memory addresses for the subsequent matrix rows.
|
||||
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums.
|
||||
As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in Listing \ref{lst:gemv64}.
|
||||
|
||||
\begin{listing}
|
||||
\begin{verbatim}
|
||||
MAC(AAM) GRF_B, BANK, GRF_A
|
||||
JUMP -1, 63
|
||||
\end{verbatim}
|
||||
\caption[The core of a \ac{mac} microkernel that utilizes the maximum number of register entries]{The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.}
|
||||
\label{lst:gemv64}
|
||||
\end{listing}
|
||||
|
||||
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
|
||||
Therefore, it is necessary to execute the \ac{gemv} microkernel several times with different input vector and weight matrix addresses.
|
||||
Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times the different input vector chunks and weight matrix columns.
|
||||
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
|
||||
|
||||
\subsubsection{Performance and Power Efficiency Achievements}
|
||||
|
||||
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \ac{hbm}-\ac{pim} to analyze its performance gains and power efficiency improvements.
|
||||
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \ac{hbm}-\ac{pim} dies and four normal \aca{hbm} dies \cite{lee2021}.
|
||||
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
|
||||
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
|
||||
|
||||
The power consumption of the \ac{hbm}-\ac{pim} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
|
||||
However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
|
||||
|
||||
In conclusion, \ac{hbm}-\ac{pim} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
|
||||
The following Section \ref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user