Samsung PIM Architecture and Instructions

This commit is contained in:
2024-02-08 22:15:12 +01:00
parent c895d71f74
commit af4e559006
7 changed files with 134 additions and 11 deletions

View File

@@ -211,6 +211,22 @@
short = SRF,
long = scalar register file,
}
\DeclareAcronym{fp16}{
short = FP16,
long = 16-bit floating-point,
}
\DeclareAcronym{fp32}{
short = FP32,
long = 32-bit floating-point,
}
\DeclareAcronym{relu}{
short = ReLU,
long = rectified linear unit,
}
\DeclareAcronym{aam}{
short = AAM,
long = address aligned mode,
}
\DeclareAcronym{tlm}{
short = TLM,
long = transaction-level modeling,

View File

@@ -111,7 +111,7 @@ For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integrat
One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$ \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, resulting in $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration.
In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
@@ -123,7 +123,7 @@ In the center of the die, the \acp{tsv} connect to the next die above or the pre
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{images/hbm}
\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}}
\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}.}
\label{img:hbm}
\end{figure}

View File

@@ -42,14 +42,14 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate
\label{sec:pim_architectures}
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
\begin{enumerate}
\item Inside the memory \ac{subarray}.
\item In the \ac{psa} region near a \ac{subarray}.
\item Outside the bank in its peripheral region.
\item In the \ac{io} region of the memory.
\item Inside the memory \ac{subarray}.
\item In the \ac{psa} region near a \ac{subarray}.
\item Outside the bank in its peripheral region.
\item In the \ac{io} region of the memory.
\end{enumerate}
Each of these approaches comes with different advantages and disadvantages.
@@ -109,7 +109,7 @@ To make full use of the output buffering, the matrix rows are interleaved in an
\begin{figure}
\centering
\input{images/hynix}
\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}}
\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}.}
\label{img:hynix}
\end{figure}
@@ -125,6 +125,8 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
\subsection{FIMDRAM/HBM-PIM}
\label{sec:pim_fim}
\subsubsection{Architecture}
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
@@ -134,17 +136,110 @@ Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm}
At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{images/fimdram}
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.}
\label{img:fimdram}
\end{figure}
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
\Ac{hbm}-\ac{pim} defines three operating modes:
\begin{enumerate}
\item \textbf{Single Bank Mode}:
This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory.
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
\item \textbf{All-Bank Mode}:
The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
\item \textbf{All-Bank-\ac{pim} Mode}:
With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
\end{enumerate}
% unterschiede zu hynix pim
% benchmark ergebnisse von samsung...
Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
\subsubsection{Processing Unit}
Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}, as well as the good support of \ac{fp16} for modern processor architectures.
The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when accessing memory.
As explained earlier, the operands come either directly from the bank or from the \acp{srf} or \acp{grf}.
Each \ac{grf} consists of 16 256-bit registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd banks, and the control unit that, among other things, decodes the instructions and keeps track of the program counter.
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{images/pcu}
\caption[Architecture of a \ac{pim} processing unit]{Architecture of a \ac{pim} processing unit \cite{lee2021}.}
\label{img:pcu}
\end{figure}
\subsubsection{Instruction Set}
The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
The data layout of these three instruction groups is shown in Table \ref{tab:isa}.
\begin{table}
\centering
\includegraphics[width=\linewidth]{images/isa}
\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
\label{tab:isa}
\end{table}
For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.
The DST and SRC fields specify the operand type, i.e., the register file or bank affected by the operation.
Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols.
The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}.
\begin{table}
\centering
\resizebox{\linewidth}{!}{%
\begin{tblr}{
hlines,
vlines,
hline{2} = {-}{solid,black},
hline{2} = {2}{-}{solid,black},
}
Type & Command & Description & Result (DST) & Operand (SRC0) & Operand (SRC1) & Operand (SRC2) \\
Control & NOP & no operation & & & & \\
Control & JUMP & jump instruction & & & & \\
Control & EXIT & exit instruction & & & & \\
Data & MOV & {move data\\from bank/register\\to register} & GRF, SRF & GRF, BANK & & \\
Data & FILL & {move data\\from bank/register\\to bank/register} & GRF, BANK & GRF, BANK & & \\
Arithmetic & ADD & addition & GRF & GRF, BANK, SRF & GRF, BANK, SRF & \\
Arithmetic & MUL & multiplication & GRF & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\
Arithmetic & MAC & multiply-accumulate & GRF-B & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\
Arithmetic & MAD & multiply-and-add & GRF & GRF, BANK & GRF, BANK, SRF &
\end{tblr}}
\caption[A list of the supported instructions with possible sources and destinations]{A list of the supported instructions with possible sources and destinations \cite{shin-haengkang2023}.}
\label{tab:instruction_set}
\end{table}
The Table \ref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
Likewise, reading from the banks, requires a \ac{rd} command.
For the control types and arithmetic instructions without the bank as a source operand, an either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.
\subsubsection{Instruction Ordering}
\label{sec:instruction_ordering}
\subsubsection{Memory Layout}
\subsubsection{Performance and Power Efficiency Effects}

View File

@@ -446,6 +446,16 @@
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3J45PFD2/Seshadri und Mutlu - 2020 - In-DRAM Bulk Bitwise Execution Engine.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DTK64DHZ/1905.html}
}
@misc{shin-haengkang2023,
title = {{{PIMSimulator}}},
author = {{Shin-haeng Kang} and {Sanghoon Cha} and {Seungwoo Seo} and {Jin-seong Kim}},
year = {2023},
month = nov,
url = {https://github.com/SAITPublic/PIMSimulator},
urldate = {2024-02-08},
abstract = {Processing-In-Memory (PIM) Simulator}
}
@misc{src2021,
title = {Decadal {{Plan}} for {{Semiconductors}}},
author = {{SRC}},

BIN
src/images/isa.pdf Normal file

Binary file not shown.

BIN
src/images/pcu.pdf Normal file

Binary file not shown.

View File

@@ -22,6 +22,8 @@
\usepackage{pgfplots}
\usepackage{bytefield}
\usepackage{mathdots}
\usepackage{tabularray}
\usepackage{makecell}
% Configurations
\usetikzlibrary{matrix}