master-thesis/src/chapters/dram.tex

\section{DRAM Architecture}
\label{sec:dram}

This section introduces the basics of modern DRAM architecture and provides the background necessary to understand the theory behind various \ac{pim} integrations.
In particular, the architecture of \ac{hbm} will be discussed, since it is the \ac{dram} technology on which the \ac{pim} architecture implemented in this thesis is based.

\subsection{DRAM Basics}
\label{sec:dram_basics}

A \ac{dram} is a special type of \ac{ram} that uses a \ac{1t1c} cell as a memory cell to store a single bit of data \cite{jacob2008}.
Because a capacitor holds electrical charge, it is a volatile form of storage, and the bit value it represents will eventually vanish over time as the stored charge is leaked.
To circumvent this, regular \textit{refresh} operations are required, involving reading and rewriting the stored value, making this storage method \textit{dynamic}.
A typical \ac{dram} device consists of several \textit{banks}, which are themselves composed of a set of \textit{memory arrays}.
The banks can be controlled independently of each other, while the memory arrays of each bank operate in lockstep mode to form the per-device data word, with the number of data bits equal to the number of memory arrays per bank.
Memory arrays, in turn, are composed of multiple \acp{subarray}.
\Acp{subarray} are grid-like structures composed of \acp{lwl} and \acp{lbl}, with a storage cell at each intersection point.
The \ac{lwl} is connected to the transistor's gate, switching it on and off, while the \ac{lbl} is used to access the stored value.
Global \acp{mwl} and \acp{mbl} span over all \acp{subarray}, forming complete \textit{rows} and \textit{columns} of a memory array.

Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, basic structure of which is illustrated in \cref{img:psa}.

\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{images/psa}
	\caption[\ac{psa} of an open bitline architecture]{\ac{psa} of an open bitline architecture \cite{jacob2008} \cite{jung2017a}.}
	\label{img:psa}
\end{figure}

However, before a value can be read, the \ac{psa} needs to \textit{precharge} its bitline to a halfway voltage $\frac{V_{DD}}{2}$ between 0 and $V_{DD}$.
When the selected wordline is then activated, the charge from the capacitor flows to the bitline and pushes the voltage level slightly in one direction.
The \ac{psa} compares the changed voltage level with an adjacent bitline in another \ac{subarray} and amplifies that difference all the way to a high or low level.

The process of loading the stored values into the \acp{psa} is done for all columns of a row at once and is called \textit{row activation}.
Once a row is activated, it can be read from or written to with a certain access granularity determined by the \ac{bl} of the memory.
To perform such a burst access, the \acp{csl} of a set of \acp{psa} must be enabled, connecting them to the more powerful \acp{ssa} that drive the actual bank \ac{io}.
Depending on the \ac{we} signal, the \acp{ssa} either sense and amplify the logic value of the \acp{psa}, or they overwrite it using the \textit{write drivers}.
The \cref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far.

\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{images/bank}
	\caption[Architecture of a single DRAM device]{Architecture of a single DRAM device \cite{jung2017a}.}
	\label{img:bank}
\end{figure}

Since a single \ac{dram} device has only a small width, for example in the case of x8 \ac{dram} a width of 8, several devices operate in lockstep mode to form the wider \textit{data bus} of the \textit{memory channel} \cite{jung2017a}.
One kind of \ac{dram} subsystem places these sets of devices on a special \ac{pcb} called \ac{dimm}.
A \ac{dimm} may also consist of several independent \textit{ranks}, which are complete sets of \ac{dram} devices connected to the same data bus, but accessed in an interleaved manner.

Besides the data bus, the channel consists also of the \textit{command bus} and the \textit{address bus}.
Over the command bus, the commands necessary to control memory are issued by the \textit{memory controller}, that sits in between the \ac{dram} and the \ac{mpsoc}.
For example, to read data, the memory controller may first issue a \ac{pre} command to precharge the bitlines in a certain bank, followed by an \iac{act} command to load the contents of a row into the \acp{psa}, and finally a \ac{rd} command to move the data from the \acp{psa} to the \acp{ssa} where it can further be exposed to the data bus.
The value on the address bus determines the row, column, bank and rank used during the respective commands, while it is the responsibility of the memory controller to translate the \ac{mpsoc}-side address to the respective components in a process called \ac{am}.
\Ac{am} ensures that the number of \textit{row misses}, i.e., the need for precharging and activating another row, is minimized.
% One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which maps the lower address bits to the columns, followed by the ranks and banks, and the highest bits to the rows.
One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in \cref{img:bank_interleaving}.
Under the assumption of a sequentially increasing address access pattern, this scheme maps the lowest bits of an address to the column bits of a row to exploit the already activated row as much as possible.
After that, instead of addressing the next row of the current bank directly, the mapping switches to another bank to take advantage of \textit{bank parallelism}.
Because banks can be controlled independently, one bank can be outputting the next data burst while another is concurrently precharging or activating a new row.

\begin{figure}
	\centering
	% \begin{tikzpicture}
	% 	\draw[step=4mm,gray,very thin] (0,0) grid (128mm,4mm);
	% 	\node[draw,minimum width=128mm,minimum height=4mm,inner sep=1pt,anchor=south west] (input) at (0,0) {\tiny Input Address};
	% 	% \node[fill=white,inner sep=1pt] at (input) {\tiny Input Address};
	% 	% \node[draw,grid,gray,very thin] (input.south west) {test};
	% 	% \draw[gray,very thin] (input.north east) grid (2,2);

	% 	\node[draw,minimum width=72mm,outer sep=0,anchor=south west] (row) at (0,-1.5) {\tiny Row};
	% 	\node[draw,minimum width=12mm,outer sep=0,anchor=west] (bank) at (row.east) {\tiny Bank};
	% 	\node[draw,minimum width=12mm,outer sep=0,anchor=west] (rank) at (bank.east) {\tiny Rank};
	% 	\node[draw,minimum width=32mm,outer sep=0,anchor=west] (column) at (rank.east) {\tiny Column};

	% 	\draw [decorate,decoration={brace,mirror}] (0,0) -- (1,0);
	% \end{tikzpicture}

	\begin{bytefield}[bitwidth=4mm,bitheight=5mm]{32}
		\bitheader[endianness=big]{0,2,3,12,13,16,17,31} \\
		\bitbox{15}{Row}
		\bitbox{4}{Bank}
		\bitbox{10}{Column}
		\bitbox{3}[bgcolor=verylightgray]{}
	\end{bytefield}

	\caption[Exemplary address mapping scheme]{Exemplary address mapping scheme for an input address of size 32.}
	\label{img:bank_interleaving}
\end{figure}

% Besides \ac{dimm}-based \ac{dram}, which is mainly used in desktop workstations, there are also \ac{dram} subsystems such as device-based \ac{dram}, where the memory devices are soldered directly on the same \ac{pcb} as the \ac{mpsoc}, or 2.5D-integrated \ac{dram}, where several memory dies are stacked on top of each other and connected to the \ac{mpsoc} by a silicon interposer \cite{jung2017a}.
In addition to \ac{dimm}-based \ac{dram}, which is mainly used in desktop workstations, there are alternative \ac{dram} subsystems.
One of these is device-based \ac{dram}, where the memory devices are directly soldered onto the same \ac{pcb} as the \ac{mpsoc}.
Another type is 2.5D-integrated \ac{dram}, where multiple memory dies are stacked on top of each other and connected to the \ac{mpsoc} by a silicon interposer \cite{jung2017a}.
Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which will be introduced in greater detail in the following section.

\subsection{\Acf{hbm}}
\label{sec:hbm}

\Aca{hbm} is a \ac{dram} standard that has been defined by \ac{jedec} in 2016 as a successor of the previous \ac{hbm} standard \cite{jedec2015a}.
What differentiates \ac{hbm} from other types of memory is its \ac{sip} approach.
Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many layers and a buffer die at the bottom, as shown in \cref{img:sip}.
\begin{figure}
	\centering
	\includegraphics[width=0.8\linewidth]{images/sip}
	\caption[Cross-section view of an \ac{hbm} \ac{sip}]{Cross-section view of a \ac{hbm} \ac{sip} \cite{lee2021}.}
	\label{img:sip}
\end{figure}
Such a cube is then placed onto a common silicon interposer that connects it to its host processor.
This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for a very wide memory interface.
For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.

One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration.
In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.

\cref{img:hbm} illustrates the internal architecture of a single memory die.
It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}.
In the center of the die, the \acp{tsv} connect to the next die above or the previous die below.

\begin{figure}
	\centering
	\includegraphics[width=0.8\linewidth]{images/hbm}
	\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}.}
	\label{img:hbm}
\end{figure}

% still, bandwidth requirements of new AI applications are not met by HBM2:waq
Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bounded limitations.
While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}.
Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor to perform operations on it.