Update on Overleaf.
This commit is contained in:
@@ -147,7 +147,7 @@ One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, ca
|
||||
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
|
||||
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
|
||||
|
||||
At the heart of \aca{fimdram} are the \acp{pu}, where one of which is shared by two banks each of a \ac{pch}.
|
||||
At the heart of \aca{fimdram} lie the \acp{pu}, where one of which is shared by two banks of the same \ac{pch}.
|
||||
The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
|
||||
|
||||
\begin{figure}
|
||||
@@ -157,7 +157,8 @@ The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
|
||||
\label{fig:pu}
|
||||
\end{figure}
|
||||
|
||||
A \ac{pu} includes 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
||||
A \ac{pu} contains two sets of \acp{fpu}, one for addition and one for multiplication, where each set contains 16 16-bit wide \ac{simd} \acp{fpu} each.
|
||||
Besides the \acp{fpu}, a \ac{pu} contains a \ac{crf}, a \ac{grf} and a \ac{srf} \cite{lee2021}.
|
||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
|
||||
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
|
||||
@@ -174,7 +175,7 @@ Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \a
|
||||
|
||||
Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to 32-bit \ac{fp} numbers.
|
||||
The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
|
||||
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
|
||||
|
||||
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
|
||||
One program that is stored in the \ac{crf} is called a \textit{microkernel}.
|
||||
Each \ac{grf} consists of 16 registers, each with the \aca{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
|
||||
|
||||
Reference in New Issue
Block a user