126 lines
7.3 KiB
TeX
126 lines
7.3 KiB
TeX
% This is samplepaper.tex, a sample chapter demonstrating the
|
|
% LLNCS macro package for Springer Computer Science proceedings;
|
|
% Version 2.20 of 2017/10/04
|
|
%
|
|
\documentclass[runningheads]{llncs}
|
|
%
|
|
\usepackage{graphicx}
|
|
\usepackage{siunitx}
|
|
\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
|
|
\usepackage{acro}
|
|
% Used for displaying a sample figure. If possible, figure files should
|
|
% be included in EPS format.
|
|
%
|
|
% If you use the hyperref package, please uncomment the following line
|
|
% to display URLs in blue roman font according to Springer's eBook style:
|
|
% \renewcommand\UrlFont{\color{blue}\rmfamily}
|
|
|
|
\sisetup{per-mode = symbol}
|
|
\input{acronyms}
|
|
|
|
\begin{document}
|
|
%
|
|
\title{Contribution Title\thanks{Supported by organization x.}}
|
|
%
|
|
%\titlerunning{Abbreviated paper title}
|
|
% If the paper title is too long for the running head, you can set
|
|
% an abbreviated paper title here
|
|
%
|
|
\author{%
|
|
Derek Christ\inst{1}\orcidID{0000-1111-2222-3333} \and
|
|
Lukas Steiner\inst{2,3}\orcidID{1111-2222-3333-4444} \and
|
|
Matthias Jung\inst{3}\orcidID{2222--3333-4444-5555} \and
|
|
Norbert Wehn\inst{3}\orcidID{2222--3333-4444-5555}
|
|
}
|
|
%
|
|
\authorrunning{F. Author et al.}
|
|
% First names are abbreviated in the running head.
|
|
% If there are more than two authors, 'et al.' is used.
|
|
%
|
|
\institute{Princeton University, Princeton NJ 08544, USA \and
|
|
Springer Heidelberg, Tiergartenstr. 17, 69121 Heidelberg, Germany
|
|
\email{lncs@springer.com}\\
|
|
\url{http://www.springer.com/gp/computer-science/lncs} \and
|
|
ABC Institute, Rupert-Karls-University Heidelberg, Heidelberg, Germany\\
|
|
\email{\{abc,lncs\}@uni-heidelberg.de}}
|
|
%
|
|
\maketitle
|
|
%
|
|
\begin{abstract}
|
|
The abstract should briefly summarize the contents of the paper in
|
|
15--250 words.
|
|
|
|
\keywords{First keyword \and Second keyword \and Another keyword.}
|
|
\end{abstract}
|
|
%
|
|
%
|
|
%
|
|
\section{Introduction}
|
|
\label{sec:intro}
|
|
% TODO Lukas/Matthias
|
|
Contributions:
|
|
\begin{itemize}
|
|
\item First time Full System Simulation of SAMSUNG-PIM
|
|
\item VP consisting of gem5 and DRAMSys
|
|
\item Experimantal verification of VP
|
|
\end{itemize}
|
|
%
|
|
\section{Related Work}
|
|
Onur Ramulator
|
|
Samsung DRAMSim2
|
|
% TODO Derek/Lukas
|
|
\section{Background DRAM-PIM}
|
|
% TODO Derek
|
|
Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
|
|
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
|
|
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
|
|
|
|
Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
|
|
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
|
|
Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
|
|
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
|
|
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
|
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
|
As a result, such an operation is a good fit for \ac{pim}.
|
|
|
|
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
|
These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{dram}'s subarray level to additional processing units at the global I/O level.
|
|
Each of these approaches comes with different advantages and disadvantages.
|
|
In short, the closer the processing is to the \ac{dram}'s subarray, the higher the energy efficiency and the achievable processing bandwidth.
|
|
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
|
|
|
One real \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, called \acf{fimdram}, has been presented in 2021 \cite{kwon2021,lee2021}.
|
|
\Ac{fimdram} is based on the \ac{hbm2} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized memory subarray \cite{kwon2021}.
|
|
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms.
|
|
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
|
|
|
|
At the heart of \aca{fimdram} lie the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
|
|
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
|
|
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
|
|
|
|
\Ac{fimdram} defines three operating modes:
|
|
The default \textbf{\ac{sb} mode}, where \aca{fimdram} has identical behavior to normal \aca{hbm2} memory.
|
|
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to specific row addresses.
|
|
The \textbf{\ac{ab} mode} is an extension to the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
|
|
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
|
|
With another predefined \ac{dram} access sequence, the memory switches to the \textbf{\ac{abp} mode}.
|
|
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
|
|
In addition, the I/O circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
|
|
Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \ac{pch} of $\qty{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty{128}{\giga\byte\per\second}$ or in total $\qty{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
|
|
|
|
|
|
\section{VP}
|
|
% TODO Derek
|
|
\section{Results}
|
|
% TODO Derek
|
|
\section{Conclusion}
|
|
% TODO Lukas/Matthias
|
|
%
|
|
|
|
\bibliographystyle{IEEEtran} % TODO change style?
|
|
\bibliography{references.bib}
|
|
|
|
\end{document}
|