Switch from HBM-PIM to PIM-HBM of FIMDRAM acronym

2024-02-14 10:27:37 +01:00
parent c78b3c12cb
commit 89dffecdf0
7 changed files with 38 additions and 30 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -189,6 +189,7 @@
 }
 \DeclareAcronym{fimdram}{
    short = FIMDRAM,
    alt = PIM-HBM,
    long = Function-In-Memory DRAM,
 }
 \DeclareAcronym{simd}{
--- a/src/chapters/implementation.tex
+++ b/src/chapters/implementation.tex
@@ -1,6 +1,11 @@
 \section{Implementation}
 \label{sec:implementation}
 The implementation of the \aca{fimdram} model is divided into three distinct parts:
 First, the processing units in the \acp{pch} of \aca{hbm} are integrated into the \ac{dram} model of DRAMSys.
 Second, a software library that uses the \ac{pim} implementation provides a \ac{api} to take advantage of in-memory processing from a user application.
 Finally, the software library is used in a gem5-based bare-metal kernel to perform \ac{pim} operations.
 \input{chapters/implementation/vm}
 \input{chapters/implementation/library}
 \input{chapters/implementation/kernel}
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -1,2 +1,2 @@
-\subsection{PIM Kernel}
+\subsection{Application Kernel}
 \label{sec:kernel}
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -1,2 +1,2 @@
-\subsection{PIM Software Library}
+\subsection{Software Library}
 \label{sec:library}
--- a/src/chapters/implementation/vm.tex
+++ b/src/chapters/implementation/vm.tex
@@ -1,2 +1,4 @@
-\subsection{PIM Virtual Machine}
+\subsection{Virtual Machine}
 \label{sec:vm}
 To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the address to calculate the register indices in the \ac{aam} operation mode.
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -122,39 +122,39 @@ Finally, the host reads the result latches from all banks at the same time and c
 Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
 As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
-\subsection{FIMDRAM/HBM-PIM}
+\subsection{\Acf{fimdram}}
 \label{sec:pim_fim}
-One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
+One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}.
 As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
 The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
 \subsubsection{Architecture}
-As the name of \ac{hbm}-\ac{pim} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
+As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
-A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
+A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
-Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
+Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
 Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
-At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
+At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
 They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
 This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
 \begin{figure}
 	\centering
 	\includegraphics[width=\linewidth]{images/fimdram}
-	\caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.}
+	\caption[Architecture of \aca{fimdram}]{Architecture of \aca{fimdram} \cite{lee2021}.}
 	\label{img:fimdram}
 \end{figure}
 As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
-The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access.
+The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
 As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
-As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
+As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
 \Ac{hbm}-\ac{pim} defines three operating modes:
 \begin{enumerate}
 	\item \textbf{Single Bank Mode}:
-	      This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory.
+	      This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
 	      To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
 	\item \textbf{All-Bank Mode}:
 	      The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
@@ -169,7 +169,7 @@ Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwid
 \subsubsection{Processing Unit}
-Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
+Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
 In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
 The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
 In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
@@ -189,12 +189,12 @@ This processing unit architecture is illustrated in \cref{img:pcu}, along with t
 	\label{img:pcu}
 \end{figure}
-To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \ac{hbm}-\ac{pim} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
+To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
-This makes \ac{hbm}-\ac{pim} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
+This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
 \subsubsection{Instruction Set}
-The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
+The \aca{fimdram} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
 The data layout of these three instruction groups is shown in \cref{tab:isa}.
 \begin{table}
@@ -299,12 +299,12 @@ This memory layout is explained in detail in \cref{sec:memory_layout}.
 \subsubsection{Programming Model}
-The software stack of \ac{hbm}-\ac{pim} is split into three main parts.
+The software stack of \aca{fimdram} is split into three main parts.
 Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as uncacheable.
 It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
 Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
 Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
-At the highest level, \ac{hbm}-\ac{pim} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
+At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
 The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
 Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
@@ -319,7 +319,7 @@ As already described in \cref{sec:instruction_ordering}, the use of the \ac{aam}
 To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
 In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
-The \ac{hbm}-\ac{pim} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
+The \aca{fimdram} architecture imposes certain dimensional constraints on the weight matrix and the input vector.
 As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
 Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
 Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
@@ -351,7 +351,7 @@ psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
 \end{equation}
 The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
-This reduction step is mandatory because there is no means in the \ac{hbm}-\ac{pim} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
+This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
 In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
 The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
@@ -382,14 +382,14 @@ In general, the more the dimensions exceed the native \ac{pim} matrix dimensions
 \subsubsection{Performance and Power Efficiency Achievements}
-In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \ac{hbm}-\ac{pim} to analyze its performance gains and power efficiency improvements.
+In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements.
-This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \ac{hbm}-\ac{pim} dies and four normal \aca{hbm} dies \cite{lee2021}.
+This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}.
 Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
 Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
-The power consumption of the \ac{hbm}-\ac{pim} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
+The power consumption of the \aca{fimdram} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
 However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
-In conclusion, \ac{hbm}-\ac{pim} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
+In conclusion, \aca{fimdram} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
-The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.
+The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \aca{fimdram} model in a simulator.
--- a/src/chapters/vp.tex
+++ b/src/chapters/vp.tex
@@ -10,14 +10,14 @@ To perform such simulations, it is necessary to use a simulation model, commonly
 \Acp{vp} act as executable software models of a physical hardware system, allowing the architecture of the system to be completely simulated in software.
 This in turn enables the software development and the identification of potential platform-specific software bugs without the need for the actual hardware implementation \cite{antonino2018}.
 \Acp{vp} provide full visibility and control over the entire simulated system, helping to identify bottlenecks and potential specification errors in the design.
-They also allow the exploration of the design space, for example, in the case of \ac{hbm}-\ac{pim}, this includes the variation of the ratio of \ac{pim} units to the number of memory banks and the effect on the performance of the \ac{pim} microkernel.
+They also allow the exploration of the design space, for example, in the case of \aca{fimdram}, this includes the variation of the ratio of \ac{pim} units to the number of memory banks and the effect on the performance of the \ac{pim} microkernel.
 However, using the appropriate level of abstraction in the software model is critical to make well-informed statements about the system without compromising the performance of the software model itself by being at a too low level, such as the \ac{rtl}.
 A viable compromise is the \ac{at} abstraction level within the \ac{tlm} technique, which is widely used in the SystemC \cite{systemc2023} virtual prototyping standard.
 The \ac{at} coding style simplifies the modeling of communication between different system components by modeling it only through function calls that are synchronized at different points in time.
 This approach eliminates the need to simulate complex bus protocols while maintaining the accuracy required for design space exploration and performance evaluation.
-Two different \ac{vp} simulation frameworks used in the implementation of the \ac{hbm}-\ac{pim} software model, namely gem5 and DRAMSys, are introduced in the following sections.
+Two different \ac{vp} simulation frameworks used in the implementation of the \aca{fimdram} software model, namely gem5 and DRAMSys, are introduced in the following sections.
 \subsection{The gem5 Simulator}
@@ -57,6 +57,6 @@ At the frontend of DRAMSys, a variety of initiators can be connected, including
 In cases where such a processor model is used to execute a user application, DRAMSys uses its internal memory model to store and retrieve the requested data, rather than ignoring the contents of the request.
 DRAMSys provides support for the latest \ac{jedec} \ac{dram} standards, including \aca{hbm}.
-Thus, gem5 and DRAMSys together form a robust platform for implementing and researching the \ac{hbm}-\ac{pim} architecture introduced by Samsung, entirely through a software model.
+Thus, gem5 and DRAMSys together form a robust platform for implementing and researching the \aca{fimdram} architecture introduced by Samsung, entirely through a software model.
 To achieve this, the \aca{hbm} \ac{dram} model must be extended to include the processing units integrated into each \ac{pch}.
-The following section provides a detailed description of this implementation of \ac{hbm}-\ac{pim}, the \ac{pim} virtual machine.
+The following section provides a detailed description of this implementation of \aca{fimdram}, the \ac{pim} virtual machine.
`@@ -1,2 +1,2 @@`
	`\subsection{PIM Kernel}`	`\subsection{Application Kernel}`
	`\label{sec:kernel}`	`\label{sec:kernel}`
`@@ -1,2 +1,2 @@`
	`\subsection{PIM Software Library}`	`\subsection{Software Library}`
	`\label{sec:library}`	`\label{sec:library}`