pimsys-paper/samplepaper.tex

% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.20 of 2017/10/04
%
\documentclass[runningheads]{llncs}
%
\usepackage{siunitx}
\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
\usepackage{acro}
\usepackage[usenames,dvipsnames]{xcolor}
\usepackage{tikz}
\usepackage{mathdots}

\usepackage{graphicx}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}

\sisetup{per-mode = symbol}

\usetikzlibrary{positioning}

\definecolor{_darkblue}{RGB}{68, 114, 196}
\definecolor{_blue}{RGB}{91, 155, 213}
\definecolor{_green}{RGB}{112, 173, 71}
\definecolor{_orange}{RGB}{237, 125, 49}
\definecolor{_yellow}{RGB}{255, 192, 0}

\input{acronyms}

\begin{document}
%
\title{Contribution Title\thanks{Supported by organization x.}}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{%
    Derek Christ\inst{1}\orcidID{0000-1111-2222-3333}    \and
    Lukas Steiner\inst{2,3}\orcidID{1111-2222-3333-4444} \and
    Matthias Jung\inst{3}\orcidID{2222--3333-4444-5555}  \and
    Norbert Wehn\inst{3}\orcidID{2222--3333-4444-5555}
}
%
\authorrunning{F. Author et al.}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Princeton University, Princeton NJ 08544, USA \and
Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
\email{lncs@springer.com}\\
\url{http://www.springer.com/gp/computer-science/lncs} \and
ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany\\
\email{\{abc,lncs\}@uni-heidelberg.de}}
%
\maketitle
%
\begin{abstract}
The abstract should briefly summarize the contents of the paper in
15--250 words.

\keywords{First keyword  \and Second keyword \and Another keyword.}
\end{abstract}
%
%
%
\section{Introduction}
\label{sec:intro}
% TODO Lukas/Matthias
Contributions:
\begin{itemize}
    \item First time Full System Simulation of SAMSUNG-PIM
    \item VP consisting of gem5 and DRAMSys
    \item Experimantal verification of VP
\end{itemize}
%
\section{Related Work}
Onur Ramulator
Samsung DRAMSim2
% TODO Derek/Lukas
\section{Background DRAM-PIM}
\label{sec:dram_pim}
% TODO Derek
Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.

Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
As a result, such an operation is a good fit for \ac{pim}.

Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{dram}'s subarray level to additional processing units at the global I/O level.
Each of these approaches comes with different advantages and disadvantages.
In short, the closer the processing is to the \ac{dram}'s subarray, the higher the energy efficiency and the achievable processing bandwidth.
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.

One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, called \acf{fimdram}, has been presented in 2021 \cite{kwon2021,lee2021}.
\Ac{fimdram} is based on the \ac{hbm2} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized memory subarray \cite{kwon2021}.
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.

At the heart of \aca{fimdram} lie the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.

\Ac{fimdram} defines three operating modes:
The default \textbf{\ac{sb} mode}, where \aca{fimdram} has identical behavior to normal \aca{hbm2} memory.
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to specific row addresses.
The \textbf{\ac{ab} mode} is an extension to the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
With another predefined \ac{dram} access sequence, the memory switches to the \textbf{\ac{abp} mode}.
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
In addition, the I/O circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \ac{pch} of $\qty{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty{128}{\giga\byte\per\second}$ or in total $\qty{2}{\tera\byte\per\second}$ for 16 \acp{pch}.

Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to 32-bit \ac{fp} numbers.
The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
One program that is stored in the \ac{crf} is called a \textit{microkernel}.
Each \ac{grf} consists of 16 registers, each with the \aca{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
The \aca{fimdram} instruction set provides a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions (NOP, JUMP, EXIT), arithmetic instructions (ADD, MUL, MAC, MAD) and data movement instructions (MOV, FILL).

Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
When an instruction executes directly on data that is provided by a memory bank, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
This means that the order of the respective memory accesses for such instructions is important and must not be reordered by the processor or memory controller, as it must match the corresponding instruction in the microkernel.
One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, however this comes at a significant performance cost and results in memory bandwidth being underutilized.
To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions.
In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself.
With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.


\section{VP}
% TODO Derek
To build a virtual prototype of \aca{fimdram}, an accurate \ac{hbm2} model is needed, where the additional \ac{pim}-\acp{pu} are integrated.
For this the cycle-accurate \ac{dram} simulator DRAMSys \cite{steiner2022a} has been used and its \ac{hbm2} model extended to incorporate the \acp{pu} into the \acp{pch} of the \ac{pim}-activated channels.
The \aca{fimdram} model itself does not need to model any timing behavior:
Its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys.

While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm2} memory.
Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active.
When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index.
After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units.
The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units.
In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor.
This is equivalent to the real \aca{fimdram} implementation, where the global I/O bus of the memory is not actually driven, and all data movement is done internally in the banks.

The model's internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
Depending on a \ac{rd} or \ac{wr} command received from the \ac{dram} model, the control flow is dispatched into one of two functions that execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit.
Both functions calculate the register indices used by the \ac{aam} execution mode followed by a branch table that dispatches to the handler of the current instruction.
In case of the data movement instructions MOV and FILL, a simple move operation that loads to value of one register or the bank data and assigns it to the destination register is performed.
The arithmetic instructions fetch the operand data is from their respective sources and perform the operation, and write back the result by modifying the internal state of the \ac{pu}.
Note that while the MAC instruction can iteratively add to the same destination register, but it does not reduce the 16-wide \ac{fp16} vector itself in any way.
Instead it is the host processor's responsibility of reducing these 16 floating point numbers into one \ac{fp16} number.

With this implementation of the processing units, it is now possible to write a user program that controls the execution of the \ac{pim}-\acp{pu} directly in the \ac{hbm2} model.
% TODO software library...

The use of \ac{aam} requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
The memory layout of a weight matrix used for e.g., a \ac{gemv} operation is illustrated in \cref{img:matrix_layout}.
\begin{figure}
    \centering
    \resizebox{0.8\linewidth}{!}{\input{images/matrix_layout}}
    \caption{Mapping of the weight matrix onto the memory banks.}
    \label{img:matrix_layout}
\end{figure}
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, while adhering a column-major matrix layout.
In a column-major matrix layout, the entries of a column are stored sequentially before switching to the next column, according to the \texttt{MATRIX[R][C]} C-like array notation.
However, the concrete element type of the array is not a single \ac{fp16} element, but a vector of 16 \acp{fp16} packed together.
This results in 16 \ac{fp16} matrix row elements being stored sequentially before switching to the next 16 \ac{fp16} elements in the next row of the same 16 columns, ensuring that a \ac{simd} processing unit always contains the data of only one matrix row.

To guarantee the correct placement of the first matrix element at the boundary of the first bank of the \ac{pch}, an alignment for the matrix data structure of $\qty{512}{\byte}$ would need to be explicitly enforced.
However, when using the \ac{aam} execution mode, this is not sufficient.
As already mentioned in \cref{sec:dram_pim}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
Therefore, the larger alignment requirement of $2^6 \cdot \qty{512}{\byte} = \qty{32768}{\byte}$ must be ensured for the weight matrix.


\section{Results}
% TODO Derek
\section{Conclusion}
% TODO Lukas/Matthias
%

\bibliographystyle{IEEEtran} % TODO change style?
\bibliography{references.bib}

\end{document}