|
|
|
@@ -122,39 +122,39 @@ Finally, the host reads the result latches from all banks at the same time and c
|
|
|
|
Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
|
|
|
|
Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
|
|
|
|
As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
|
|
|
|
As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{FIMDRAM/HBM-PIM}
|
|
|
|
\subsection{\Acf{fimdram}}
|
|
|
|
\label{sec:pim_fim}
|
|
|
|
\label{sec:pim_fim}
|
|
|
|
|
|
|
|
|
|
|
|
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
|
|
|
|
One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}.
|
|
|
|
As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
|
|
|
|
As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
|
|
|
|
The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
|
|
|
|
The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Architecture}
|
|
|
|
\subsubsection{Architecture}
|
|
|
|
As the name of \ac{hbm}-\ac{pim} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
|
|
|
|
As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
|
|
|
|
A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
|
|
|
|
A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
|
|
|
|
Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
|
|
|
|
Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
|
|
|
|
Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
|
|
|
|
Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
|
|
|
|
|
|
|
|
|
|
|
|
At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
|
|
|
|
At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
|
|
|
|
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
|
|
|
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
|
|
|
This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
|
|
|
This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\begin{figure}
|
|
|
|
\centering
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\linewidth]{images/fimdram}
|
|
|
|
\includegraphics[width=\linewidth]{images/fimdram}
|
|
|
|
\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.}
|
|
|
|
\caption[Architecture of \aca{fimdram}]{Architecture of \aca{fimdram} \cite{lee2021}.}
|
|
|
|
\label{img:fimdram}
|
|
|
|
\label{img:fimdram}
|
|
|
|
\end{figure}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
|
|
|
As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
|
|
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
|
|
|
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
|
|
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
|
|
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
|
|
|
As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
|
|
|
|
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
|
|
|
|
|
|
|
|
|
|
|
|
\Ac{hbm}-\ac{pim} defines three operating modes:
|
|
|
|
\Ac{hbm}-\ac{pim} defines three operating modes:
|
|
|
|
\begin{enumerate}
|
|
|
|
\begin{enumerate}
|
|
|
|
\item \textbf{Single Bank Mode}:
|
|
|
|
\item \textbf{Single Bank Mode}:
|
|
|
|
This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory.
|
|
|
|
This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
|
|
|
|
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
|
|
|
|
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
|
|
|
|
\item \textbf{All-Bank Mode}:
|
|
|
|
\item \textbf{All-Bank Mode}:
|
|
|
|
The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
|
|
|
|
The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
|
|
|
|
@@ -169,7 +169,7 @@ Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwid
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Processing Unit}
|
|
|
|
\subsubsection{Processing Unit}
|
|
|
|
|
|
|
|
|
|
|
|
Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
|
|
|
|
Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
|
|
|
|
In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
|
|
|
|
In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
|
|
|
|
The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
|
|
|
|
The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
|
|
|
|
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
|
|
|
|
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
|
|
|
|
@@ -189,12 +189,12 @@ This processing unit architecture is illustrated in \cref{img:pcu}, along with t
|
|
|
|
\label{img:pcu}
|
|
|
|
\label{img:pcu}
|
|
|
|
\end{figure}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \ac{hbm}-\ac{pim} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
|
|
|
|
To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
|
|
|
|
This makes \ac{hbm}-\ac{pim} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
|
|
|
|
This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Instruction Set}
|
|
|
|
\subsubsection{Instruction Set}
|
|
|
|
|
|
|
|
|
|
|
|
The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
|
|
|
|
The \aca{fimdram} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
|
|
|
|
The data layout of these three instruction groups is shown in \cref{tab:isa}.
|
|
|
|
The data layout of these three instruction groups is shown in \cref{tab:isa}.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
\begin{table}
|
|
|
|
@@ -299,12 +299,12 @@ This memory layout is explained in detail in \cref{sec:memory_layout}.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Programming Model}
|
|
|
|
\subsubsection{Programming Model}
|
|
|
|
|
|
|
|
|
|
|
|
The software stack of \ac{hbm}-\ac{pim} is split into three main parts.
|
|
|
|
The software stack of \aca{fimdram} is split into three main parts.
|
|
|
|
Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as uncacheable.
|
|
|
|
Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as uncacheable.
|
|
|
|
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
|
|
|
|
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
|
|
|
|
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
|
|
|
|
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
|
|
|
|
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
|
|
|
|
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
|
|
|
|
At the highest level, \ac{hbm}-\ac{pim} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
|
|
|
|
At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
|
|
|
|
|
|
|
|
|
|
|
|
The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
|
|
|
|
The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
|
|
|
|
Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
|
|
|
|
Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
|
|
|
|
@@ -319,7 +319,7 @@ As already described in \cref{sec:instruction_ordering}, the use of the \ac{aam}
|
|
|
|
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
|
|
|
|
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
|
|
|
|
In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
|
|
|
|
In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
|
|
|
|
|
|
|
|
|
|
|
|
The \ac{hbm}-\ac{pim} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
|
|
|
|
The \aca{fimdram} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
|
|
|
|
As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
|
|
|
|
As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
|
|
|
|
Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
|
|
|
|
Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
|
|
|
|
Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
|
|
|
|
Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
|
|
|
|
@@ -351,7 +351,7 @@ psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
|
|
|
|
\end{equation}
|
|
|
|
\end{equation}
|
|
|
|
|
|
|
|
|
|
|
|
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
|
|
|
|
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
|
|
|
|
This reduction step is mandatory because there is no means in the \ac{hbm}-\ac{pim} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
|
|
|
|
This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
|
|
|
|
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
|
|
|
|
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
|
|
|
|
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
|
|
|
|
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
|
|
|
|
|
|
|
|
|
|
|
|
@@ -382,14 +382,14 @@ In general, the more the dimensions exceed the native \ac{pim} matrix dimensions
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Performance and Power Efficiency Achievements}
|
|
|
|
\subsubsection{Performance and Power Efficiency Achievements}
|
|
|
|
|
|
|
|
|
|
|
|
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \ac{hbm}-\ac{pim} to analyze its performance gains and power efficiency improvements.
|
|
|
|
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements.
|
|
|
|
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \ac{hbm}-\ac{pim} dies and four normal \aca{hbm} dies \cite{lee2021}.
|
|
|
|
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}.
|
|
|
|
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
|
|
|
|
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
|
|
|
|
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
|
|
|
|
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
|
|
|
|
|
|
|
|
|
|
|
|
The power consumption of the \ac{hbm}-\ac{pim} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
|
|
|
|
The power consumption of the \aca{fimdram} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
|
|
|
|
However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
|
|
|
|
However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
|
|
|
|
|
|
|
|
|
|
|
|
In conclusion, \ac{hbm}-\ac{pim} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
|
|
|
|
In conclusion, \aca{fimdram} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
|
|
|
|
The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.
|
|
|
|
The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \aca{fimdram} model in a simulator.
|
|
|
|
|
|
|
|
|
|
|
|
|