Samsung PIM Architecture and Instructions
This commit is contained in:
@@ -211,6 +211,22 @@
|
||||
short = SRF,
|
||||
long = scalar register file,
|
||||
}
|
||||
\DeclareAcronym{fp16}{
|
||||
short = FP16,
|
||||
long = 16-bit floating-point,
|
||||
}
|
||||
\DeclareAcronym{fp32}{
|
||||
short = FP32,
|
||||
long = 32-bit floating-point,
|
||||
}
|
||||
\DeclareAcronym{relu}{
|
||||
short = ReLU,
|
||||
long = rectified linear unit,
|
||||
}
|
||||
\DeclareAcronym{aam}{
|
||||
short = AAM,
|
||||
long = address aligned mode,
|
||||
}
|
||||
\DeclareAcronym{tlm}{
|
||||
short = TLM,
|
||||
long = transaction-level modeling,
|
||||
|
||||
@@ -111,7 +111,7 @@ For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integrat
|
||||
|
||||
One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
|
||||
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
|
||||
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$ \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, resulting in $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
|
||||
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
|
||||
A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration.
|
||||
In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
|
||||
Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
|
||||
@@ -123,7 +123,7 @@ In the center of the die, the \acp{tsv} connect to the next die above or the pre
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.8\linewidth]{images/hbm}
|
||||
\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}}
|
||||
\caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}.}
|
||||
\label{img:hbm}
|
||||
\end{figure}
|
||||
|
||||
|
||||
@@ -42,14 +42,14 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate
|
||||
\label{sec:pim_architectures}
|
||||
|
||||
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
||||
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
||||
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
||||
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
|
||||
|
||||
\begin{enumerate}
|
||||
\item Inside the memory \ac{subarray}.
|
||||
\item In the \ac{psa} region near a \ac{subarray}.
|
||||
\item Outside the bank in its peripheral region.
|
||||
\item In the \ac{io} region of the memory.
|
||||
\item Inside the memory \ac{subarray}.
|
||||
\item In the \ac{psa} region near a \ac{subarray}.
|
||||
\item Outside the bank in its peripheral region.
|
||||
\item In the \ac{io} region of the memory.
|
||||
\end{enumerate}
|
||||
|
||||
Each of these approaches comes with different advantages and disadvantages.
|
||||
@@ -109,7 +109,7 @@ To make full use of the output buffering, the matrix rows are interleaved in an
|
||||
\begin{figure}
|
||||
\centering
|
||||
\input{images/hynix}
|
||||
\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}}
|
||||
\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}.}
|
||||
\label{img:hynix}
|
||||
\end{figure}
|
||||
|
||||
@@ -125,6 +125,8 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
|
||||
\subsection{FIMDRAM/HBM-PIM}
|
||||
\label{sec:pim_fim}
|
||||
|
||||
\subsubsection{Architecture}
|
||||
|
||||
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
|
||||
As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
|
||||
A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
|
||||
@@ -134,17 +136,110 @@ Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm}
|
||||
At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
|
||||
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
||||
This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{images/fimdram}
|
||||
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}}
|
||||
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.}
|
||||
\label{img:fimdram}
|
||||
\end{figure}
|
||||
|
||||
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
||||
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
|
||||
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
||||
As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
|
||||
|
||||
\Ac{hbm}-\ac{pim} defines three operating modes:
|
||||
\begin{enumerate}
|
||||
\item \textbf{Single Bank Mode}:
|
||||
This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory.
|
||||
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
|
||||
\item \textbf{All-Bank Mode}:
|
||||
The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
|
||||
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
|
||||
\item \textbf{All-Bank-\ac{pim} Mode}:
|
||||
With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
|
||||
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
|
||||
In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
|
||||
\end{enumerate}
|
||||
|
||||
% unterschiede zu hynix pim
|
||||
% benchmark ergebnisse von samsung...
|
||||
Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
|
||||
|
||||
\subsubsection{Processing Unit}
|
||||
|
||||
Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}, as well as the good support of \ac{fp16} for modern processor architectures.
|
||||
The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
|
||||
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
|
||||
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when accessing memory.
|
||||
As explained earlier, the operands come either directly from the bank or from the \acp{srf} or \acp{grf}.
|
||||
Each \ac{grf} consists of 16 256-bit registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
|
||||
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
|
||||
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
|
||||
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
|
||||
This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd banks, and the control unit that, among other things, decodes the instructions and keeps track of the program counter.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=0.8\linewidth]{images/pcu}
|
||||
\caption[Architecture of a \ac{pim} processing unit]{Architecture of a \ac{pim} processing unit \cite{lee2021}.}
|
||||
\label{img:pcu}
|
||||
\end{figure}
|
||||
|
||||
\subsubsection{Instruction Set}
|
||||
|
||||
The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
|
||||
The data layout of these three instruction groups is shown in Table \ref{tab:isa}.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{images/isa}
|
||||
\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
|
||||
\label{tab:isa}
|
||||
\end{table}
|
||||
|
||||
For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
|
||||
The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
|
||||
Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.
|
||||
|
||||
The DST and SRC fields specify the operand type, i.e., the register file or bank affected by the operation.
|
||||
Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols.
|
||||
The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
|
||||
Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\resizebox{\linewidth}{!}{%
|
||||
\begin{tblr}{
|
||||
hlines,
|
||||
vlines,
|
||||
hline{2} = {-}{solid,black},
|
||||
hline{2} = {2}{-}{solid,black},
|
||||
}
|
||||
Type & Command & Description & Result (DST) & Operand (SRC0) & Operand (SRC1) & Operand (SRC2) \\
|
||||
Control & NOP & no operation & & & & \\
|
||||
Control & JUMP & jump instruction & & & & \\
|
||||
Control & EXIT & exit instruction & & & & \\
|
||||
Data & MOV & {move data\\from bank/register\\to register} & GRF, SRF & GRF, BANK & & \\
|
||||
Data & FILL & {move data\\from bank/register\\to bank/register} & GRF, BANK & GRF, BANK & & \\
|
||||
Arithmetic & ADD & addition & GRF & GRF, BANK, SRF & GRF, BANK, SRF & \\
|
||||
Arithmetic & MUL & multiplication & GRF & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\
|
||||
Arithmetic & MAC & multiply-accumulate & GRF-B & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\
|
||||
Arithmetic & MAD & multiply-and-add & GRF & GRF, BANK & GRF, BANK, SRF &
|
||||
\end{tblr}}
|
||||
\caption[A list of the supported instructions with possible sources and destinations]{A list of the supported instructions with possible sources and destinations \cite{shin-haengkang2023}.}
|
||||
\label{tab:instruction_set}
|
||||
\end{table}
|
||||
|
||||
The Table \ref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
|
||||
It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
|
||||
For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
|
||||
Likewise, reading from the banks, requires a \ac{rd} command.
|
||||
For the control types and arithmetic instructions without the bank as a source operand, an either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
|
||||
The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.
|
||||
|
||||
\subsubsection{Instruction Ordering}
|
||||
\label{sec:instruction_ordering}
|
||||
|
||||
\subsubsection{Memory Layout}
|
||||
|
||||
\subsubsection{Performance and Power Efficiency Effects}
|
||||
|
||||
10
src/doc.bib
10
src/doc.bib
@@ -446,6 +446,16 @@
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3J45PFD2/Seshadri und Mutlu - 2020 - In-DRAM Bulk Bitwise Execution Engine.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DTK64DHZ/1905.html}
|
||||
}
|
||||
|
||||
@misc{shin-haengkang2023,
|
||||
title = {{{PIMSimulator}}},
|
||||
author = {{Shin-haeng Kang} and {Sanghoon Cha} and {Seungwoo Seo} and {Jin-seong Kim}},
|
||||
year = {2023},
|
||||
month = nov,
|
||||
url = {https://github.com/SAITPublic/PIMSimulator},
|
||||
urldate = {2024-02-08},
|
||||
abstract = {Processing-In-Memory (PIM) Simulator}
|
||||
}
|
||||
|
||||
@misc{src2021,
|
||||
title = {Decadal {{Plan}} for {{Semiconductors}}},
|
||||
author = {{SRC}},
|
||||
|
||||
BIN
src/images/isa.pdf
Normal file
BIN
src/images/isa.pdf
Normal file
Binary file not shown.
BIN
src/images/pcu.pdf
Normal file
BIN
src/images/pcu.pdf
Normal file
Binary file not shown.
@@ -22,6 +22,8 @@
|
||||
\usepackage{pgfplots}
|
||||
\usepackage{bytefield}
|
||||
\usepackage{mathdots}
|
||||
\usepackage{tabularray}
|
||||
\usepackage{makecell}
|
||||
|
||||
% Configurations
|
||||
\usetikzlibrary{matrix}
|
||||
|
||||
Reference in New Issue
Block a user