Start of HBM-PIM
This commit is contained in:
@@ -143,6 +143,10 @@
|
||||
short = DDR4,
|
||||
long = Double Data Rate 4,
|
||||
}
|
||||
\DeclareAcronym{gddr6}{
|
||||
short = GDDR6,
|
||||
long = Graphics Double Data Rate 6,
|
||||
}
|
||||
\DeclareAcronym{tsv}{
|
||||
short = TSV,
|
||||
long = trough-silicon via,
|
||||
@@ -183,6 +187,30 @@
|
||||
short = SDK,
|
||||
long = software development kit,
|
||||
}
|
||||
\DeclareAcronym{fimdram}{
|
||||
short = FIMDRAM,
|
||||
long = Function-In-Memory DRAM,
|
||||
}
|
||||
\DeclareAcronym{simd}{
|
||||
short = SIMD,
|
||||
long = single-instruction multiple-data,
|
||||
}
|
||||
\DeclareAcronym{fpu}{
|
||||
short = FPU,
|
||||
long = floating-point unit,
|
||||
}
|
||||
\DeclareAcronym{crf}{
|
||||
short = CRF,
|
||||
long = command register file,
|
||||
}
|
||||
\DeclareAcronym{grf}{
|
||||
short = GRF,
|
||||
long = general register file,
|
||||
}
|
||||
\DeclareAcronym{srf}{
|
||||
short = SRF,
|
||||
long = scalar register file,
|
||||
}
|
||||
\DeclareAcronym{tlm}{
|
||||
short = TLM,
|
||||
long = transaction-level modeling,
|
||||
|
||||
@@ -107,7 +107,7 @@ Several \ac{dram} dies are stacked on top of each other and connected with \acp{
|
||||
\end{figure}
|
||||
Such a cube is then placed onto a common silicon interposer that connects it to its host processor.
|
||||
This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for a very wide memory interface.
|
||||
For example, compared to a conventional \acs{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
|
||||
For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
|
||||
|
||||
One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
|
||||
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
|
||||
|
||||
@@ -96,7 +96,7 @@ This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{
|
||||
\subsection{Newton AiM}
|
||||
\label{sec:pim_newton}
|
||||
|
||||
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in GDDR memory called Newton \cite{he2020}.
|
||||
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in \ac{gddr6} memory called Newton \cite{he2020}.
|
||||
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
|
||||
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
|
||||
Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
|
||||
@@ -124,4 +124,27 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
|
||||
|
||||
\subsection{FIMDRAM/HBM-PIM}
|
||||
\label{sec:pim_fim}
|
||||
|
||||
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
|
||||
As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
|
||||
A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
|
||||
Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
|
||||
Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
|
||||
|
||||
At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
|
||||
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
||||
This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{images/fimdram}
|
||||
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
|
||||
\label{img:fimdram}
|
||||
\end{figure}
|
||||
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
||||
As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
|
||||
|
||||
|
||||
% unterschiede zu hynix pim
|
||||
% benchmark ergebnisse von samsung...
|
||||
|
||||
BIN
src/images/fimdram.pdf
Normal file
BIN
src/images/fimdram.pdf
Normal file
Binary file not shown.
Reference in New Issue
Block a user