Newton

2024-02-07 22:37:15 +01:00
parent 5f52c3ae9f
commit 607bbae8d4
5 changed files with 100 additions and 9 deletions
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -75,7 +75,6 @@ Because banks can be controlled independently, one bank can be outputting the ne
 	% 	\draw [decorate,decoration={brace,mirror}] (0,0) -- (1,0);
 	% \end{tikzpicture}

-	\definecolor{verylightgray}{gray}{0.85}
 	\begin{bytefield}[bitwidth=4mm,bitheight=5mm]{32}
 		\bitheader[endianness=big]{0,2,3,12,13,16,17,31} \\
 		\bitbox{15}{Row}
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -34,6 +34,9 @@ Such an operation, defined in the widely used \ac{blas} library \cite{blas1979},
 Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
 Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
 As a result, such an operation is a good fit for \ac{pim}.
+In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons.
+Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
+Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}.

 \subsection{PIM Architectures}
 \label{sec:pim_architectures}
@@ -51,7 +54,9 @@ In essence, these placements of the approaches can be summarized as follows \cit

 Each of these approaches comes with different advantages and disadvantages.
 In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
-On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
+Only when the compute units are placed within the bank region, the full bank parallelism can be used to retrieve and process data concurrently.
+Outside the bank region, the data retrieval is limited by the narrow memory bus.
+On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.

 Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
 It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
@@ -78,7 +83,7 @@ In the following, three \ac{pim} approaches that place the compute units at the
 The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
 UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
 In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
-The host processor can access the memory banks to copy input data from main memory and retrieve results.
+The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
 While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
 UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout.

@@ -86,12 +91,36 @@ Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set
 The \acp{dpu} execute compiled C code using a specialized compiler toolchain that provides limited support of the standard library.
 With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
 A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}.
-This gives a maximum \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
+This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.

 \subsection{Newton AiM}
 \label{sec:pim_newton}

-% gddr (device-based)
+In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in GDDR memory called Newton \cite{he2020}.
+In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
+To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
+Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
+On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted.
+In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks.
+The broadcasted input vector is then multiplied by a matrix row by doing a column access to the \ac{dram} bank, producing a $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values.
+These temporary products are then reduced to a single output vector element by the adder tree in the bank.
+To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}.
+
+\begin{figure}
+	\centering
+	\input{images/hynix}
+	\caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}}
+	\label{img:hynix}
+\end{figure}
+
+As illustrated in Figure \ref{img:hynix}, a matrix row is distributed across all banks and partitioned into separate chunks, filling the complete \ac{dram} row.
+This is to ensure that the input vector is fully used and never refetched - all matrix rows of a corresponding chunk are multiplied by the input vector chunk before moving to the next chunk.
+If this is done repeatedly, the temporary results will be accumulated in the output vector.
+Since all the banks are operating on the same input vector at the same time, a single Newton \ac{dram} command will perform the arithmetic operations for all the banks in the memory.
+Finally, the host reads the result latches from all banks at the same time and concatenates them to form the complete output vector.
+
+Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
+As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.

 \subsection{FIMDRAM/HBM-PIM}
 \label{sec:pim_fim}
--- a/src/images/dnn.tex
+++ b/src/images/dnn.tex
@@ -35,7 +35,7 @@
 	\draw[red!60,thick] (inode2.east) to (onode1.west);
 	\draw[red!60,thick] (inode3.east) to (onode1.west);

-	\matrix (matrix) [matrix of nodes,left delimiter=(,right delimiter=),right of=onode2,node distance=3.5cm] {
+	\matrix (matrix) [matrix of nodes,left delimiter=(,right delimiter=),right=1.5cm of onode2] {
 		$w_{0,0}$ & $w_{0,1}$ & $w_{0,2}$ & $w_{0,3}$ \\
 		$w_{1,0}$ & $w_{1,1}$ & $w_{1,2}$ & $w_{1,3}$ \\
 		$w_{2,0}$ & $w_{2,1}$ & $w_{2,2}$ & $w_{2,3}$ \\
@@ -43,7 +43,7 @@
 		$w_{4,0}$ & $w_{4,1}$ & $w_{4,2}$ & $w_{4,3}$ \\
 	};

-	\node (prod) [right of=matrix,node distance=2.6cm] {$*$};
+	\node (prod) [right=4mm of matrix] {$*$};

 	\matrix (input_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=prod] {
 		$i_{0}$ \\
@@ -52,9 +52,9 @@
 		$i_{3}$ \\
 	};

-	\node (eq) [right of=input_vector,node distance=1.1cm] {$=$};
+	\node (eq) [right=4mm of input_vector] {$=$};

-	\matrix (output_vector) [matrix of nodes,left delimiter=(,right delimiter=),right of=eq,node distance=1.1cm] {
+	\matrix (output_vector) [matrix of nodes,left delimiter=(,right delimiter=),right=4mm of eq] {
 		$o_{0}$ \\
 		$o_{1}$ \\
 		$o_{2}$ \\
--- a/src/images/hynix.tex
+++ b/src/images/hynix.tex
@@ -0,0 +1,58 @@
+\begin{tikzpicture}
+\pgfdeclarelayer{background layer}
+\pgfdeclarelayer{foreground layer}
+\pgfsetlayers{background layer,main,foreground layer}
+
+\begin{pgfonlayer}{foreground layer}
+\node[draw,outer sep=0,minimum size=3cm,fill=white] (bank0) {Bank 0};
+% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row0) at (bank0.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank0chunk0) at (bank0.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank0chunk1) at (bank0chunk0.north east) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank0chunk2) at (bank0chunk1.north east) {};
+% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank0chunk3) at (bank0chunk2.north east) {};
+\end{pgfonlayer}
+
+\node[draw,outer sep=0,minimum size=3cm,fill=white,above right=1.5mm of bank0.south west] (bank1) {};
+% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row1) at (bank1.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank1chunk0) at (bank1.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank1chunk1) at (bank1chunk0.north east) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank1chunk2) at (bank1chunk1.north east) {};
+% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank1chunk3) at (bank1chunk2.north east) {};
+
+\begin{pgfonlayer}{background layer}
+\node[draw,outer sep=0,minimum size=3cm,fill=white,above right=1.5mm of bank1.south west] (bank2) {};
+% \node[draw,outer sep=0,minimum width=3cm,minimum height=2mm,fill=white,anchor=north west] (matrix_row2) at (bank2.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!20,anchor=north west] (bank2chunk0) at (bank2.north west) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!30,anchor=north west] (bank2chunk1) at (bank2chunk0.north east) {};
+\node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!40,anchor=north west] (bank2chunk2) at (bank2chunk1.north east) {};
+% \node[draw,outer sep=0,minimum width=1cm,minimum height=2mm,fill=ForestGreen!50,anchor=north west] (bank2chunk3) at (bank2chunk2.north east) {};
+\end{pgfonlayer}
+
+\node[above=0mm of bank2] {$\iddots$};
+
+\node (prod) [right=of bank0] {$*$};
+
+\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of prod] (input) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!20,anchor=north] (inputchunk0) at (input.north) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!30,anchor=north] (inputchunk1) at (inputchunk0.south) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!40,anchor=north] (inputchunk2) at (inputchunk1.south) {};
+% \node[draw,outer sep=0,minimum width=2mm,minimum height=1cm,fill=ForestGreen!50,anchor=north] (inputchunk3) at (inputchunk2.south) {};
+
+\node (equal) [right=of input] {$=$};
+
+\begin{pgfonlayer}{foreground layer}
+\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,right=of equal] (output) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output.north) {};
+\end{pgfonlayer}
+
+\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,above right=1.5mm of output.south west] (output1) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output1.north) {};
+
+\begin{pgfonlayer}{background layer}
+\node[draw,outer sep=0,minimum width=2mm,minimum height=3cm,fill=white,above right=1.5mm of output1.south west] (output2) {};
+\node[draw,outer sep=0,minimum width=2mm,minimum height=2mm,fill=Apricot!50,anchor=north] at (output2.north) {};
+\end{pgfonlayer}
+
+\node[above right=0mm of output2] {$\iddots$};
+
+\end{tikzpicture}
--- a/src/index.tex
+++ b/src/index.tex
@@ -21,11 +21,13 @@
 \usepackage[square,numbers]{natbib}
 \usepackage{pgfplots}
 \usepackage{bytefield}
+\usepackage{mathdots}

 % Configurations
 \usetikzlibrary{matrix}
 \usetikzlibrary{automata}
 \usetikzlibrary{fit}
+\usetikzlibrary{positioning}
 \setlength\textheight{24cm}
 \setkomafont{paragraph}{\footnotesize}
 \numberwithin{table}{section}
@@ -34,6 +36,9 @@
 \numberwithin{figure}{section}
 \sisetup{group-separator = {,}, group-minimum-digits = 4}

+% Custom colors
+\definecolor{verylightgray}{gray}{0.85}
+
 % Penalties
 \clubpenalty = 10000
 \widowpenalty = 10000