Implementation of the virtual machine

This commit is contained in:
2024-02-14 17:32:23 +01:00
parent 89dffecdf0
commit c4f9383dad
7 changed files with 108 additions and 15 deletions

View File

@@ -122,7 +122,7 @@ Finally, the host reads the result latches from all banks at the same time and c
Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
\subsection{\Acf{fimdram}}
\subsection{\Acl{fimdram}}
\label{sec:pim_fim}
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}.
@@ -130,6 +130,7 @@ As this is the \ac{pim} architecture which was implemented during the work on th
The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
\subsubsection{Architecture}
\label{sec:pim_architecture}
As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
@@ -146,32 +147,32 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the
\label{img:fimdram}
\end{figure}
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
\Ac{hbm}-\ac{pim} defines three operating modes:
\begin{enumerate}
\item \textbf{Single Bank Mode}:
\item \textbf{\Ac{sb} Mode}:
This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
\item \textbf{All-Bank Mode}:
The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
\item \textbf{\Ac{ab} Mode}:
The \ac{ab} mode is an extension of the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
\item \textbf{All-Bank-\ac{pim} Mode}:
\item \textbf{\Ac{abp} Mode}:
With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
\end{enumerate}
Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
Both in \ac{ab} mode and in \ac{ab}-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
\subsubsection{Processing Unit}
Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers.
The \ac{simd} \ac{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
One program that is stored in the \ac{crf} is called a \textit{microkernel}.
@@ -340,8 +341,8 @@ This interleaving is illustrated in \cref{img:input_vector}.
\label{img:input_vector}
\end{figure}
To initialize the input vector in this way, the host processor can use all-bank mode.
From the processor's point of view, only the first bank is initialized, but the all-bank mode ensures that the same data is written to all banks at the same time.
To initialize the input vector in this way, the host processor can use \ac{ab} mode.
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as follows: