pimsys-paper/samplepaper.tex

% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.20 of 2017/10/04
%
\documentclass[runningheads]{llncs}
%
\usepackage{graphicx}
\usepackage{siunitx}
\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
\usepackage{acro}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}

\sisetup{per-mode = symbol}
\input{acronyms}

\begin{document}
%
\title{Contribution Title\thanks{Supported by organization x.}}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{%
    Derek Christ\inst{1}\orcidID{0000-1111-2222-3333}    \and
    Lukas Steiner\inst{2,3}\orcidID{1111-2222-3333-4444} \and
    Matthias Jung\inst{3}\orcidID{2222--3333-4444-5555}  \and
    Norbert Wehn\inst{3}\orcidID{2222--3333-4444-5555}
}
%
\authorrunning{F. Author et al.}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{Princeton University, Princeton NJ 08544, USA \and
Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
\email{lncs@springer.com}\\
\url{http://www.springer.com/gp/computer-science/lncs} \and
ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany\\
\email{\{abc,lncs\}@uni-heidelberg.de}}
%
\maketitle
%
\begin{abstract}
The abstract should briefly summarize the contents of the paper in
15--250 words.

\keywords{First keyword  \and Second keyword \and Another keyword.}
\end{abstract}
%
%
%
\section{Introduction}
\label{sec:intro}
% TODO Lukas/Matthias
Contributions:
\begin{itemize}
    \item First time Full System Simulation of SAMSUNG-PIM
    \item VP consisting of gem5 and DRAMSys
    \item Experimantal verification of VP
\end{itemize}
%
\section{Related Work}
Onur Ramulator
Samsung DRAMSim2
% TODO Derek/Lukas
\section{Background DRAM-PIM}
% TODO Derek
Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.

Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
As a result, such an operation is a good fit for \ac{pim}.

Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{dram}'s subarray level to additional processing units at the global I/O level.
Each of these approaches comes with different advantages and disadvantages.
In short, the closer the processing is to the \ac{dram}'s subarray, the higher the energy efficiency and the achievable processing bandwidth.
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.

One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, called \acf{fimdram}, has been presented in 2021 \cite{kwon2021,lee2021}.
\Ac{fimdram} is based on the \ac{hbm2} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized memory subarray \cite{kwon2021}.
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.

At the heart of \aca{fimdram} lie the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.

\Ac{fimdram} defines three operating modes:
The default \textbf{\ac{sb} mode}, where \aca{fimdram} has identical behavior to normal \aca{hbm2} memory.
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to specific row addresses.
The \textbf{\ac{ab} mode} is an extension to the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
With another predefined \ac{dram} access sequence, the memory switches to the \textbf{\ac{abp} mode}.
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
In addition, the I/O circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \ac{pch} of $\qty{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty{128}{\giga\byte\per\second}$ or in total $\qty{2}{\tera\byte\per\second}$ for 16 \acp{pch}.


\section{VP}
% TODO Derek
\section{Results}
% TODO Derek
\section{Conclusion}
% TODO Lukas/Matthias
%

\bibliographystyle{IEEEtran} % TODO change style?
\bibliography{references.bib}

\end{document}