Newton
This commit is contained in:
@@ -75,7 +75,6 @@ Because banks can be controlled independently, one bank can be outputting the ne
|
||||
% \draw [decorate,decoration={brace,mirror}] (0,0) -- (1,0);
|
||||
% \end{tikzpicture}
|
||||
|
||||
\definecolor{verylightgray}{gray}{0.85}
|
||||
\begin{bytefield}[bitwidth=4mm,bitheight=5mm]{32}
|
||||
\bitheader[endianness=big]{0,2,3,12,13,16,17,31} \\
|
||||
\bitbox{15}{Row}
|
||||
|
||||
@@ -34,6 +34,9 @@ Such an operation, defined in the widely used \ac{blas} library \cite{blas1979},
|
||||
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
||||
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
||||
As a result, such an operation is a good fit for \ac{pim}.
|
||||
In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons.
|
||||
Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
|
||||
Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}.
|
||||
|
||||
\subsection{PIM Architectures}
|
||||
\label{sec:pim_architectures}
|
||||
@@ -51,7 +54,9 @@ In essence, these placements of the approaches can be summarized as follows \cit
|
||||
|
||||
Each of these approaches comes with different advantages and disadvantages.
|
||||
In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
|
||||
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
||||
Only when the compute units are placed within the bank region, the full bank parallelism can be used to retrieve and process data concurrently.
|
||||
Outside the bank region, the data retrieval is limited by the narrow memory bus.
|
||||
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
||||
|
||||
Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
|
||||
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
|
||||
@@ -78,7 +83,7 @@ In the following, three \ac{pim} approaches that place the compute units at the
|
||||
The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
|
||||
UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
|
||||
In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
|
||||
The host processor can access the memory banks to copy input data from main memory and retrieve results.
|
||||
The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
|
||||
While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
|
||||
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout.
|
||||
|
||||
@@ -86,12 +91,36 @@ Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set
|
||||
The \acp{dpu} execute compiled C code using a specialized compiler toolchain that provides limited support of the standard library.
|
||||
With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
|
||||
A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}.
|
||||
This gives a maximum \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
|
||||
This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
|
||||
|
||||
\subsection{Newton AiM}
|
||||
\label{sec:pim_newton}
|
||||
|
||||
% gddr (device-based)
|
||||
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in GDDR memory called Newton \cite{he2020}.
|
||||
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
|
||||
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
|
||||
Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
|
||||
On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted.
|
||||
In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks.
|
||||
The broadcasted input vector is then multiplied by a matrix row by doing a column access to the \ac{dram} bank, producing a $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values.
|
||||
These temporary products are then reduced to a single output vector element by the adder tree in the bank.
|
||||
To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\input{images/hynix}
|
||||
\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}}
|
||||
\label{img:hynix}
|
||||
\end{figure}
|
||||
|
||||
As illustrated in Figure \ref{img:hynix}, a matrix row is distributed across all banks and partitioned into separate chunks, filling the complete \ac{dram} row.
|
||||
This is to ensure that the input vector is fully used and never refetched - all matrix rows of a corresponding chunk are multiplied by the input vector chunk before moving to the next chunk.
|
||||
If this is done repeatedly, the temporary results will be accumulated in the output vector.
|
||||
Since all the banks are operating on the same input vector at the same time, a single Newton \ac{dram} command will perform the arithmetic operations for all the banks in the memory.
|
||||
Finally, the host reads the result latches from all banks at the same time and concatenates them to form the complete output vector.
|
||||
|
||||
Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
|
||||
As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
|
||||
|
||||
\subsection{FIMDRAM/HBM-PIM}
|
||||
\label{sec:pim_fim}
|
||||
|
||||
@@ -35,7 +35,7 @@
|
||||
\draw[red!60,thick] (inode2.east) to (onode1.west);
|
||||
\draw[red!60,thick] (inode3.east) to (onode1.west);
|
||||
|
||||
\matrix (matrix) [matrix of nodes,left delimiter=(,right delimiter=),right of=onode2,node distance=3.5cm] {
|
||||
\matrix (matrix) [matrix of nodes,left delimiter=(,right delimiter=),right=1.5cm of onode2] {
|
||||
$w_{0,0}$ & $w_{0,1}$ & $w_{0,2}$ & $w_{0,3}$ \\
|
||||
$w_{1,0}$ & $w_{1,1}$ & $w_{1,2}$ & $w_{1,3}$ \\
|
||||
$w_{2,0}$ & $w_{2,1}$ & $w_{2,2}$ & $w_{2,3}$ \\
|
||||
@@ -43,7 +43,7 @@
|
||||
$w_{4,0}$ & $w_{4,1}$ & $w_{4,2}$ & $w_{4,3}$ \\
|
||||
};
|
||||
|
||||
\node (prod) [right of=matrix,node distance=2.6cm] {$*$};
|
||||
\node (prod) [right=4mm of matrix] {$*$};
|
||||
|
||||
\matrix (input_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=prod] {
|
||||
$i_{0}$ \\
|
||||
@@ -52,9 +52,9 @@
|
||||
$i_{3}$ \\
|
||||
};
|
||||
|
||||
\node (eq) [right of=input_vector,node distance=1.1cm] {$=$};
|
||||
\node (eq) [right=4mm of input_vector] {$=$};
|
||||
|
||||
\matrix (output_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=eq,node distance=1.1cm] {
|
||||
\matrix (output_vector) [matrix of nodes,left delimiter=(,right delimiter=),right=4mm of eq] {
|
||||
$o_{0}$ \\
|
||||
$o_{1}$ \\
|
||||
$o_{2}$ \\
|
||||
|
||||
58
src/images/hynix.tex
Normal file
58
src/images/hynix.tex
Normal file
@@ -0,0 +1,58 @@
|
||||
\begin{tikzpicture}
|
||||
\pgfdeclarelayer{background layer}
|
||||
\pgfdeclarelayer{foreground layer}
|
||||
\pgfsetlayers{background layer,main,foreground layer}
|
||||
|
||||
\begin{pgfonlayer}{foreground layer}
|
||||
\node[draw,outer sep=0,minimum size=3cm,fill=white] (bank0) {Bank 0};
|
||||
% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row0) at (bank0.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank0chunk0) at (bank0.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank0chunk1) at (bank0chunk0.north east) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank0chunk2) at (bank0chunk1.north east) {};
|
||||
% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank0chunk3) at (bank0chunk2.north east) {};
|
||||
\end{pgfonlayer}
|
||||
|
||||
\node[draw,outer sep=0,minimum size=3cm,fill=white,above right=1.5mm of bank0.south west] (bank1) {};
|
||||
% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row1) at (bank1.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank1chunk0) at (bank1.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank1chunk1) at (bank1chunk0.north east) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank1chunk2) at (bank1chunk1.north east) {};
|
||||
% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank1chunk3) at (bank1chunk2.north east) {};
|
||||
|
||||
\begin{pgfonlayer}{background layer}
|
||||
\node[draw,outer sep=0,minimum size=3cm,fill=white,above right=1.5mm of bank1.south west] (bank2) {};
|
||||
% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row2) at (bank2.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank2chunk0) at (bank2.north west) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank2chunk1) at (bank2chunk0.north east) {};
|
||||
\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank2chunk2) at (bank2chunk1.north east) {};
|
||||
% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank2chunk3) at (bank2chunk2.north east) {};
|
||||
\end{pgfonlayer}
|
||||
|
||||
\node[above=0mm of bank2] {$\iddots$};
|
||||
|
||||
\node (prod) [right=of bank0] {$*$};
|
||||
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of prod] (input) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!20,anchor=north] (inputchunk0) at (input.north) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!30,anchor=north] (inputchunk1) at (inputchunk0.south) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!40,anchor=north] (inputchunk2) at (inputchunk1.south) {};
|
||||
% \node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!50,anchor=north] (inputchunk3) at (inputchunk2.south) {};
|
||||
|
||||
\node (equal) [right=of input] {$=$};
|
||||
|
||||
\begin{pgfonlayer}{foreground layer}
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of equal] (output) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output.north) {};
|
||||
\end{pgfonlayer}
|
||||
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,above right=1.5mm of output.south west] (output1) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output1.north) {};
|
||||
|
||||
\begin{pgfonlayer}{background layer}
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,above right=1.5mm of output1.south west] (output2) {};
|
||||
\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output2.north) {};
|
||||
\end{pgfonlayer}
|
||||
|
||||
\node[above right=0mm of output2] {$\iddots$};
|
||||
|
||||
\end{tikzpicture}
|
||||
@@ -21,11 +21,13 @@
|
||||
\usepackage[square,numbers]{natbib}
|
||||
\usepackage{pgfplots}
|
||||
\usepackage{bytefield}
|
||||
\usepackage{mathdots}
|
||||
|
||||
% Configurations
|
||||
\usetikzlibrary{matrix}
|
||||
\usetikzlibrary{automata}
|
||||
\usetikzlibrary{fit}
|
||||
\usetikzlibrary{positioning}
|
||||
\setlength\textheight{24cm}
|
||||
\setkomafont{paragraph}{\footnotesize}
|
||||
\numberwithin{table}{section}
|
||||
@@ -34,6 +36,9 @@
|
||||
\numberwithin{figure}{section}
|
||||
\sisetup{group-separator = {,}, group-minimum-digits = 4}
|
||||
|
||||
% Custom colors
|
||||
\definecolor{verylightgray}{gray}{0.85}
|
||||
|
||||
% Penalties
|
||||
\clubpenalty = 10000
|
||||
\widowpenalty = 10000
|
||||
|
||||
Reference in New Issue
Block a user