First part of DRAM basics
This commit is contained in:
@@ -34,6 +34,10 @@
|
||||
short = DRAM,
|
||||
long = dynamic random-access memory,
|
||||
}
|
||||
\DeclareAcronym{ram}{
|
||||
short = RAM,
|
||||
long = random-access memory,
|
||||
}
|
||||
\DeclareAcronym{hbm}{
|
||||
short = HBM,
|
||||
long = High Bandwidth Memory,
|
||||
@@ -42,6 +46,38 @@
|
||||
short = PIM,
|
||||
long = processing-in-memory,
|
||||
}
|
||||
\DeclareAcronym{subarray}{
|
||||
short = SA,
|
||||
long = subarray,
|
||||
}
|
||||
\DeclareAcronym{lwl}{
|
||||
short = LWL,
|
||||
long = local wordline,
|
||||
}
|
||||
\DeclareAcronym{lbl}{
|
||||
short = LWL,
|
||||
long = local bitline,
|
||||
}
|
||||
\DeclareAcronym{mwl}{
|
||||
short = MWL,
|
||||
long = master wordline,
|
||||
}
|
||||
\DeclareAcronym{mbl}{
|
||||
short = MWL,
|
||||
long = master bitline,
|
||||
}
|
||||
\DeclareAcronym{psa}{
|
||||
short = PSA,
|
||||
long = primary sense amplifier,
|
||||
}
|
||||
\DeclareAcronym{ssa}{
|
||||
short = SSA,
|
||||
long = secondary sense amplifier,
|
||||
}
|
||||
\DeclareAcronym{csl}{
|
||||
short = CSL,
|
||||
long = column select line,
|
||||
}
|
||||
\DeclareAcronym{tlm}{
|
||||
short = TLM,
|
||||
long = transaction level modeling,
|
||||
|
||||
@@ -1,2 +1,43 @@
|
||||
\section{DRAM Architecture}
|
||||
\label{sec:dram}
|
||||
\label{sec:dram}
|
||||
|
||||
This section introduces the basics of modern DRAM architecture and provides the background necessary to understand the theory behind various \ac{pim} integrations.
|
||||
In particular, the architecture of \ac{hbm} will be discussed, since it is the \ac{dram} technology on which the \ac{pim} architecture implemented in this thesis is based.
|
||||
|
||||
\subsection{DRAM Basics}
|
||||
\label{sec:dram_basics}
|
||||
|
||||
A \ac{dram} is a special type of \ac{ram} that uses a single transistor-capacitor pair as a memory cell to encode exactly one bit \cite{jacob2008}.
|
||||
Since a capacitor holds electrical charge, it is a volatile type of storage and the bit value it represents eventually vanishes over time as the stored charge is leaked.
|
||||
To circumvent this, regular \textit{refresh} operations are required, involving reading and rewriting the stored value, making this storage method \textit{dynamic}.
|
||||
A typical \ac{dram} device consists of several banks, which are themselves composed of a set of \textit{memory arrays}, which in turn are composed of multiple \acp{subarray}.
|
||||
Banks operate independently of each other, while the memory arrays of each bank operate in lockstep mode to form the per-device data word, with the number of data bits equal to the number of memory arrays per bank.
|
||||
The \acp{subarray} are grid-like structures composed of \acp{lwl} and \acp{lbl}, with a storage cell at each intersection point.
|
||||
The \ac{lwl} is connected to the transistor's gate, switching it on and off, while the \ac{lbl} is used to access the stored value.
|
||||
Global \acp{mwl} and \acp{mbl} span over all \acp{subarray}, forming complete \textit{rows} and \textit{columns} of a memory array.
|
||||
|
||||
Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the stored voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, illustrated in figure \ref{img:psa}.
|
||||
|
||||
\begin{figure}[!ht]
|
||||
\centering
|
||||
\includegraphics{images/psa}
|
||||
\caption[\ac{psa} of an open bitline architecture]{\ac{psa} of an open bitline architecture \cite{jacob2008} \cite{jung2017a}}
|
||||
\label{img:psa}
|
||||
\end{figure}
|
||||
|
||||
However, before a value can be read, the \ac{psa} needs to \textit{precharge} its bitline to a halfway voltage $\frac{V_{DD}}{2}$ between 0 and $V_{DD}$.
|
||||
When the capacitor is then connected to the bitline, it pushes the voltage level marginally in one direction, enough for the \ac{psa} to detect the voltage difference to an adjacent bitline in another \ac{subarray} and amplifies the voltage level all the way to high or low.
|
||||
|
||||
The process of loading the stored value into the \ac{psa} is done for all columns at the same time and is called \textit{row activation}.
|
||||
Once a row is activated, it is referred to as \textit{open} and following from a
|
||||
% \ac{csl}
|
||||
|
||||
\begin{figure}[!ht]
|
||||
\centering
|
||||
\includegraphics{images/bank}
|
||||
\caption[]{\cite{jung2017a}}
|
||||
\label{img:bank}
|
||||
\end{figure}
|
||||
|
||||
\subsection{High Bandwidth Memory}
|
||||
\label{sec:hbm}
|
||||
|
||||
@@ -24,12 +24,12 @@ In addition, Moore's Law is slowing down as further device scaling approaches ph
|
||||
The exponential grow in compute energy will eventually be constrained by market dynamics, flattening the energy curve and making it impossible to meet future computing demands.
|
||||
It is therefore required to achieve radical improvements in energy efficiency in order to avoid such a scenario.
|
||||
|
||||
In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for \ac{ai} applications \cite{kwon2021}.
|
||||
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands uses consumes more power than performing the computation on them itself.
|
||||
In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for \ac{ai} applications than general-purpose processors \cite{kwon2021}.
|
||||
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
|
||||
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
|
||||
|
||||
Furthermore, many types of \ac{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}.
|
||||
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.
|
||||
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}.
|
||||
|
||||
\begin{figure}[!ht]
|
||||
\centering
|
||||
@@ -43,10 +43,10 @@ However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm
|
||||
|
||||
All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bounded, new approaches to computing are required.
|
||||
This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}.
|
||||
\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}.
|
||||
\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bounded problems.
|
||||
|
||||
This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors.
|
||||
The remainder is structured as follows:
|
||||
The remainder of this work is structured as follows:
|
||||
Section \ref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \ac{hbm}.
|
||||
In section \ref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
|
||||
Section \ref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
|
||||
|
||||
@@ -47,7 +47,7 @@
|
||||
archiveprefix = {arxiv},
|
||||
langid = {english},
|
||||
keywords = {read},
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/UFED59VX/Chen et al. - 2023 - SimplePIM A Software Framework for Productive and.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/X2X78NCZ/diss_matthias_jung.pdf}
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/UFED59VX/Chen et al. - 2023 - SimplePIM A Software Framework for Productive and.pdf}
|
||||
}
|
||||
|
||||
@misc{dally2010,
|
||||
|
||||
BIN
src/images/bank.pdf
Normal file
BIN
src/images/bank.pdf
Normal file
Binary file not shown.
BIN
src/images/psa.pdf
Normal file
BIN
src/images/psa.pdf
Normal file
Binary file not shown.
@@ -28,8 +28,8 @@
|
||||
40 70
|
||||
100 70
|
||||
}
|
||||
node[above,sloped,pos=0.25,scale=0.8] {\textit{memory-bound}}
|
||||
node[above,pos=0.75,scale=0.8] {\textit{compute-bound}};
|
||||
node[above,sloped,pos=0.25,scale=0.8] {\textit{memory bound}}
|
||||
node[above,pos=0.75,scale=0.8] {\textit{compute bound}};
|
||||
|
||||
\addplot [very thick, dashed, BrickRed]
|
||||
table {
|
||||
|
||||
Reference in New Issue
Block a user