Start of HBM-PIM

2024-02-08 15:41:50 +01:00
parent 607bbae8d4
commit c895d71f74
4 changed files with 53 additions and 2 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -143,6 +143,10 @@
    short = DDR4,
    long = Double Data Rate 4,
 }
+\DeclareAcronym{gddr6}{
+    short = GDDR6,
+    long = Graphics Double Data Rate 6,
+}
 \DeclareAcronym{tsv}{
    short = TSV,
    long = trough-silicon via,
@@ -183,6 +187,30 @@
    short = SDK,
    long = software development kit,
 }
+\DeclareAcronym{fimdram}{
+    short = FIMDRAM,
+    long = Function-In-Memory DRAM,
+}
+\DeclareAcronym{simd}{
+    short = SIMD,
+    long = single-instruction multiple-data,
+}
+\DeclareAcronym{fpu}{
+    short = FPU,
+    long = floating-point unit,
+}
+\DeclareAcronym{crf}{
+    short = CRF,
+    long = command register file,
+}
+\DeclareAcronym{grf}{
+    short = GRF,
+    long = general register file,
+}
+\DeclareAcronym{srf}{
+    short = SRF,
+    long = scalar register file,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -107,7 +107,7 @@ Several \ac{dram} dies are stacked on top of each other and connected with \acp{
 \end{figure}
 Such a cube is then placed onto a common silicon interposer that connects it to its host processor.
 This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for a very wide memory interface.
-For example, compared to a conventional \acs{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
+For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.

 One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
 The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -96,7 +96,7 @@ This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{
 \subsection{Newton AiM}
 \label{sec:pim_newton}

-In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in GDDR memory called Newton \cite{he2020}.
+In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in \ac{gddr6} memory called Newton \cite{he2020}.
 In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
 To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
 Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
@@ -124,4 +124,27 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a

 \subsection{FIMDRAM/HBM-PIM}
 \label{sec:pim_fim}
+
+One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
+As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
+A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
+Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
+Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
+
+At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
+They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
+This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
+\begin{figure}
+	\centering
+	\includegraphics[width=\linewidth]{images/fimdram}
+	\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
+	\label{img:fimdram}
+\end{figure}
+As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
+The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
+As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
+As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
+
+
 % unterschiede zu hynix pim
+% benchmark ergebnisse von samsung...
--- a/src/images/fimdram.pdf
+++ b/src/images/fimdram.pdf