Update on Overleaf.
This commit is contained in:
@@ -128,7 +128,7 @@ Many types of \acp{dnn} used for language and speech processing, such as \acp{rn
|
||||
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
|
||||
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
|
||||
|
||||
Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
|
||||
A large number of modern \acp{dnn} layers can be expressed as a matrix-vector multiplication.
|
||||
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
|
||||
Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
|
||||
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
|
||||
@@ -147,8 +147,17 @@ One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, ca
|
||||
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
|
||||
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
|
||||
|
||||
At the heart of \aca{fimdram} lie the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
|
||||
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
||||
At the heart of \aca{fimdram} are the \acp{pu}, where one of which is shared by two banks each of a \ac{pch}.
|
||||
The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics{images/pu.eps}
|
||||
\caption{Caption}
|
||||
\label{fig:pu}
|
||||
\end{figure}
|
||||
|
||||
A \ac{pu} includes 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
|
||||
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
|
||||
|
||||
Reference in New Issue
Block a user