From 60ed5de838903053429f4e8700459cb29a7bed7f Mon Sep 17 00:00:00 2001 From: "christ.derek" Date: Fri, 22 Mar 2024 08:44:35 +0000 Subject: [PATCH] Update on Overleaf. --- samplepaper.tex | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/samplepaper.tex b/samplepaper.tex index 4b578fa..ac4ca69 100644 --- a/samplepaper.tex +++ b/samplepaper.tex @@ -147,7 +147,7 @@ One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, ca A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms. Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes. -At the heart of \aca{fimdram} are the \acp{pu}, where one of which is shared by two banks each of a \ac{pch}. +At the heart of \aca{fimdram} lie the \acp{pu}, where one of which is shared by two banks of the same \ac{pch}. The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}. \begin{figure} @@ -157,7 +157,8 @@ The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}. \label{fig:pu} \end{figure} -A \ac{pu} includes 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}. +A \ac{pu} contains two sets of \acp{fpu}, one for addition and one for multiplication, where each set contains 16 16-bit wide \ac{simd} \acp{fpu} each. +Besides the \acp{fpu}, a \ac{pu} contains a \ac{crf}, a \ac{grf} and a \ac{srf} \cite{lee2021}. The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access. As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}. As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor. @@ -174,7 +175,7 @@ Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \a Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to 32-bit \ac{fp} numbers. The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. -In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}. + The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access. One program that is stored in the \ac{crf} is called a \textit{microkernel}. Each \ac{grf} consists of 16 registers, each with the \aca{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.