Update on Overleaf.

2024-03-22 08:19:14 +00:00
parent 2cc65ece70
commit 3763e82b87
1 changed files with 12 additions and 3 deletions
--- a/samplepaper.tex
+++ b/samplepaper.tex
@@ -128,7 +128,7 @@ Many types of \acp{dnn} used for language and speech processing, such as \acp{rn
 As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
 In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.

-Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
+A large number of modern \acp{dnn} layers can be expressed as a matrix-vector multiplication.
 The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
 Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
 Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
@@ -147,8 +147,17 @@ One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, ca
 A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
 Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.

-At the heart of \aca{fimdram} lie the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
-They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
+At the heart of \aca{fimdram} are the \acp{pu}, where one of which is shared by two banks each of a \ac{pch}.
+The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
+
+\begin{figure}
+    \centering
+    \includegraphics{images/pu.eps}
+    \caption{Caption}
+    \label{fig:pu}
+\end{figure}
+
+A \ac{pu} includes 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
 The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
 As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
 As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.