diff --git a/src/acronyms.tex b/src/acronyms.tex index 5698422..2dba8d1 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -139,6 +139,10 @@ short = JEDEC, long = Joint Electron Device Engineering Council, } +\DeclareAcronym{lpddr}{ + short = LPDDR, + long = Low-Power Double Data Rate, +} \DeclareAcronym{ddr4}{ short = DDR4, long = Double Data Rate 4, diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex index fd200be..9855283 100644 --- a/src/chapters/dram.tex +++ b/src/chapters/dram.tex @@ -2,7 +2,7 @@ \label{sec:dram} This section introduces the basics of modern DRAM architecture and provides the background necessary to understand the theory behind various \ac{pim} integrations. -In particular, the architecture of \ac{hbm} will be discussed, since it is the \ac{dram} technology on which the \ac{pim} architecture implemented in this thesis is based. +In particular, the architecture of \aca{hbm} will be discussed, since it is the \ac{dram} technology on which the \ac{pim} architecture implemented in this thesis is based. \subsection{DRAM Basics} \label{sec:dram_basics} @@ -17,7 +17,7 @@ Memory arrays, in turn, are composed of multiple \acp{subarray}. The \ac{lwl} is connected to the transistor's gate, switching it on and off, while the \ac{lbl} is used to access the stored value. Global \acp{mwl} and \acp{mbl} span over all \acp{subarray}, forming complete \textit{rows} and \textit{columns} of a memory array. -Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, basic structure of which is illustrated in \cref{img:psa}. +Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, whose basic structure of which is illustrated in \cref{img:psa}. \begin{figure} \centering @@ -28,11 +28,11 @@ Because the charge stored in each cell is very small, so-called \acp{psa} are ne However, before a value can be read, the \ac{psa} needs to \textit{precharge} its bitline to a halfway voltage $\frac{V_{DD}}{2}$ between 0 and $V_{DD}$. When the selected wordline is then activated, the charge from the capacitor flows to the bitline and pushes the voltage level slightly in one direction. -The \ac{psa} compares the changed voltage level with an adjacent bitline in another \ac{subarray} and amplifies that difference all the way to a high or low level. +The \ac{psa} compares the changed voltage level with an adjacent bitline in another \ac{subarray} and amplifies that difference all the way to high or low. The process of loading the stored values into the \acp{psa} is done for all columns of a row at once and is called \textit{row activation}. Once a row is activated, it can be read from or written to with a certain access granularity determined by the \ac{bl} of the memory. -To perform such a burst access, the \acp{csl} of a set of \acp{psa} must be enabled, connecting them to the more powerful \acp{ssa} that drive the actual bank \ac{io}. +To perform such a burst access, the \acp{csl} of a set of \acp{psa} are enabled, connecting them to the more powerful \acp{ssa} that drive the actual bank \ac{io}. Depending on the \ac{we} signal, the \acp{ssa} either sense and amplify the logic value of the \acp{psa}, or they overwrite it using the \textit{write drivers}. The \cref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far. @@ -43,19 +43,19 @@ The \cref{img:bank} summarizes the basic architecture of a single storage device \label{img:bank} \end{figure} -Since a single \ac{dram} device has only a small width, for example in the case of x8 \ac{dram} a width of 8, several devices operate in lockstep mode to form the wider \textit{data bus} of the \textit{memory channel} \cite{jung2017a}. -One kind of \ac{dram} subsystem places these sets of devices on a special \ac{pcb} called \ac{dimm}. +Since a single \ac{dram} device has only a small bit-width, for example in the case of x8 \ac{dram} a width of 8, several devices operate in lockstep mode to form the wider \textit{data bus} of the \textit{memory channel} \cite{jung2017a}. +One kind of \ac{dram} subsystem places these sets of devices on a special \ac{pcb} is called \ac{dimm}. A \ac{dimm} may also consist of several independent \textit{ranks}, which are complete sets of \ac{dram} devices connected to the same data bus, but accessed in an interleaved manner. Besides the data bus, the channel consists also of the \textit{command bus} and the \textit{address bus}. Over the command bus, the commands necessary to control memory are issued by the \textit{memory controller}, that sits in between the \ac{dram} and the \ac{mpsoc}. -For example, to read data, the memory controller may first issue a \ac{pre} command to precharge the bitlines in a certain bank, followed by an \iac{act} command to load the contents of a row into the \acp{psa}, and finally a \ac{rd} command to move the data from the \acp{psa} to the \acp{ssa} where it can further be exposed to the data bus. +For example, to read data, the memory controller may first issue a \ac{pre} command to precharge the bitlines in a certain bank, followed by an \iac{act} command to load the contents of a row into the \acp{psa}, and finally a \ac{rd} command to move the data from the \acp{psa} to the \acp{ssa} where it can further be exposed onto the data bus. The value on the address bus determines the row, column, bank and rank used during the respective commands, while it is the responsibility of the memory controller to translate the \ac{mpsoc}-side address to the respective components in a process called \ac{am}. The \ac{am} ensures that the number of \textit{row misses}, i.e., the need for precharging and activating another row, is minimized. % One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which maps the lower address bits to the columns, followed by the ranks and banks, and the highest bits to the rows. One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in \cref{img:bank_interleaving}. Under the assumption of a sequentially increasing address access pattern, this scheme maps the lowest bits of an address to the column bits of a row to exploit the already activated row as much as possible. -After that, instead of addressing the next row of the current bank directly, the mapping switches to another bank to take advantage of \textit{bank parallelism}. +After that, instead of addressing the next row of the same bank directly, the mapping switches to another bank to take advantage of the \textit{bank parallelism}. Because banks can be controlled independently, one bank can be outputting the next data burst while another is concurrently precharging or activating a new row. \begin{figure} @@ -89,8 +89,8 @@ Because banks can be controlled independently, one bank can be outputting the ne % Besides \ac{dimm}-based \ac{dram}, which is mainly used in desktop workstations, there are also \ac{dram} subsystems such as device-based \ac{dram}, where the memory devices are soldered directly on the same \ac{pcb} as the \ac{mpsoc}, or 2.5D-integrated \ac{dram}, where several memory dies are stacked on top of each other and connected to the \ac{mpsoc} by a silicon interposer \cite{jung2017a}. In addition to \ac{dimm}-based \ac{dram}, which is mainly used in desktop workstations, there are alternative \ac{dram} subsystems. -One of these is device-based \ac{dram}, where the memory devices are directly soldered onto the same \ac{pcb} as the \ac{mpsoc}. -Another type is 2.5D-integrated \ac{dram}, where multiple memory dies are stacked on top of each other and connected to the \ac{mpsoc} by a silicon interposer \cite{jung2017a}. +One of these is device-based \ac{dram}, where the memory devices are soldered directly onto the same \ac{pcb} as the \ac{mpsoc} commonly used by \ac{lpddr}. +Another type is 2.5D-integrated \ac{dram}, where multiple memory dies are stacked on top of each other and connected to the \ac{mpsoc} on a silicon interposer \cite{jung2017a}. Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which will be introduced in greater detail in the following section. \subsection{\Acl{hbm}} @@ -98,16 +98,16 @@ Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which w \Aca{hbm} is a \ac{dram} standard that has been defined by \ac{jedec} in 2016 as a successor of the previous \ac{hbm} standard \cite{jedec2015a}. What differentiates \ac{hbm} from other types of memory is its \ac{sip} approach. -Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many layers and a buffer die at the bottom, as shown in \cref{img:sip}. +Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many die layers and a buffer die at the bottom, as shown in \cref{img:sip}. \begin{figure} \centering \includegraphics[width=0.8\linewidth]{images/sip} \caption[Cross-section view of an \ac{hbm} \ac{sip}]{Cross-section view of a \ac{hbm} \ac{sip} \cite{lee2021}.} \label{img:sip} \end{figure} -Such a cube is then placed onto a common silicon interposer that connects it to its host processor. -This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for a very wide memory interface. -For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}. +Such a cube is then placed onto a common silicon interposer that connects the \ac{dram} to its host processor. +This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for an extraordinary wide memory interface. +For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}. One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups. The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$. @@ -118,7 +118,7 @@ Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32} \cref{img:hbm} illustrates the internal architecture of a single memory die. It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}. -In the center of the die, the \acp{tsv} connect to the next die above or the previous die below. +In the center of the die, the \acp{tsv} connect the die to the next die above it and the previous die below it. \begin{figure} \centering @@ -130,4 +130,4 @@ In the center of the die, the \acp{tsv} connect to the next die above or the pre % still, bandwidth requirements of new AI applications are not met by HBM2:waq Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bounded limitations. While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}. -Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor to perform operations on it. +Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor for performing operations on it.