Start of HBM-PIM

This commit is contained in:
2024-02-08 15:41:50 +01:00
parent 607bbae8d4
commit c895d71f74
4 changed files with 53 additions and 2 deletions

View File

@@ -143,6 +143,10 @@
short = DDR4,
long = Double Data Rate 4,
}
\DeclareAcronym{gddr6}{
short = GDDR6,
long = Graphics Double Data Rate 6,
}
\DeclareAcronym{tsv}{
short = TSV,
long = trough-silicon via,
@@ -183,6 +187,30 @@
short = SDK,
long = software development kit,
}
\DeclareAcronym{fimdram}{
short = FIMDRAM,
long = Function-In-Memory DRAM,
}
\DeclareAcronym{simd}{
short = SIMD,
long = single-instruction multiple-data,
}
\DeclareAcronym{fpu}{
short = FPU,
long = floating-point unit,
}
\DeclareAcronym{crf}{
short = CRF,
long = command register file,
}
\DeclareAcronym{grf}{
short = GRF,
long = general register file,
}
\DeclareAcronym{srf}{
short = SRF,
long = scalar register file,
}
\DeclareAcronym{tlm}{
short = TLM,
long = transaction-level modeling,

View File

@@ -107,7 +107,7 @@ Several \ac{dram} dies are stacked on top of each other and connected with \acp{
\end{figure}
Such a cube is then placed onto a common silicon interposer that connects it to its host processor.
This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for a very wide memory interface.
For example, compared to a conventional \acs{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.

View File

@@ -96,7 +96,7 @@ This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{
\subsection{Newton AiM}
\label{sec:pim_newton}
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in GDDR memory called Newton \cite{he2020}.
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in \ac{gddr6} memory called Newton \cite{he2020}.
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
@@ -124,4 +124,27 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
\subsection{FIMDRAM/HBM-PIM}
\label{sec:pim_fim}
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{images/fimdram}
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
\label{img:fimdram}
\end{figure}
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
% unterschiede zu hynix pim
% benchmark ergebnisse von samsung...

BIN
src/images/fimdram.pdf Normal file

Binary file not shown.