master-thesis/src/chapters/pim.tex

\section{Processing-in-Memory}
\label{sec:pim}

In the conventional von Neumann architecture, compute is completely separated from memory.
Memory-intensive workloads operate on a large data set, have poor spatial and temporal locality, and low operational density.
As a consequence, the data movement between the memory and compute forms the so-called von Neumann bottleneck \cite{zou2021}.
In the past, this bottleneck was obfuscated using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}.
However, new memory-intensive applications, including \acp{dnn}, have led researchers to reconsider \ac{pim} as a new approach to meet future processing demands.
First proposals for \ac{pim} date back to the 1970s and were hindered by the limitations of existing memory systems, but are now experiencing a renaissance \cite{radojkovic2021,ghose2019a}.

In the following, the workloads suitable for \ac{pim} will be discussed in more detail, followed by an overview of the different types of \ac{pim} implementations.
Finally, a number of concrete implementation examples are presented.

\subsection{Applicable Workloads}
\label{sec:pim_workloads}

As already discussed in \cref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
For problems like this, \ac{pim} is only of limited use.

Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
Pairwise multiplication of the input vector and a row of the matrix can be used to calculate an entry of the output vector.
This process is illustrated in \cref{img:dnn} where one \ac{dnn} layer is processed.

\begin{figure}
	\centering
	\input{images/dnn}
	\caption[A fully connected \ac{dnn} layer.]{A fully connected \ac{dnn} layer \cite{he2020}.}
	\label{img:dnn}
\end{figure}

Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
As a result, such an operation is a good fit for \ac{pim}.
In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons:
Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}.

\subsection{PIM Architectures}
\label{sec:pim_architectures}

Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:

\begin{enumerate}
	\item Inside the memory \ac{subarray}.
	\item In the \ac{psa} region near a \ac{subarray}.
	\item Outside the bank in its peripheral region.
	\item In the \ac{io} region of the memory.
\end{enumerate}

Each of these approaches comes with different advantages and disadvantages.
In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
Only when the compute units are placed within the bank region, the full bank parallelism can be used to retrieve and process data concurrently.
Outside the bank region, the data retrieval is limited by the narrow memory bus.
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.

Processing \textbf{inside the \ac{subarray}} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
An example of such an approach is Ambit \cite{seshadri2020}.
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.

Far fewer, but still challenging, constraints are placed on the integration of compute units \textbf{in the region of the \acp{psa}}.
The approach presented in \cite{sudarshan2022a} consists of a two-stage design that integrates current mirror-based analog units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications.

The integration of compute units \textbf{in the \ac{io} region of the bank} allows for area intensive operations such as ADD, \ac{mac} or \ac{mad}.
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.

Placing the compute units \textbf{in the \ac{io} region of the \ac{dram}} has the fewest physical limitations and allows for complex accelerators, implementing a complete \ac{isa}.
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.

In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are presented in more detail.

\subsection{UPMEM}
\label{sec:pim_upmem}

The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer.

Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library.
With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}, which gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.

\subsection{Newton AiM}
\label{sec:pim_newton}

In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology using \ac{gddr6} memory called Newton \cite{he2020}.
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
Another advantage of this approach is that there is no kernel startup delay required to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted.
In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks.
The broadcasted input vector is then multiplied by a matrix row by performing a column access to the \ac{dram} bank, producing $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values.
These temporary products are then reduced to a single output vector element by the adder tree in the bank.
To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}.

\begin{figure}
	\centering
	\input{images/hynix}
	\caption[Newton memory layout for a \ac{gemv} operation.]{Newton memory layout for a \ac{gemv} operation \cite{he2020}.}
	\label{img:hynix}
\end{figure}

As illustrated in \cref{img:hynix}, a matrix row is distributed across all banks and partitioned into separate chunks, filling the complete \ac{dram} row.
This is to ensure that the input vector is fully used and never refetched - all matrix rows of a corresponding chunk are multiplied by the input vector chunk before moving to the next chunk.
If this is done repeatedly, the temporary results will be accumulated in the output vector.
Since all the banks are operating on the same input vector at the same time, a single Newton \ac{dram} command will perform the arithmetic operations for all the banks in the memory.
Finally, the host reads the result latches from all banks at the same time and concatenates them to form the complete output vector.

Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.

\subsection{\Acl{fimdram}}
\label{sec:pim_fim}

One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}.
As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.

\subsubsection{Architecture}
\label{sec:pim_architecture}
As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized \acp{subarray} \cite{kwon2021}.
A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm} platforms.
Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory into a \ac{pim}-enabled region and a normal \ac{hbm} region.

At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.

\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{images/fimdram}
	\caption[Architecture of \aca{fimdram}.]{Architecture of \aca{fimdram} \cite{lee2021}.}
	\label{img:fimdram}
\end{figure}

As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.

\Ac{hbm}-\ac{pim} defines three operating modes:
\begin{enumerate}
	\item \textbf{\Ac{sb} Mode}:
	      This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
	      To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
	\item \textbf{\Ac{ab} Mode}:
	      The \ac{ab} mode is an extension of the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
	      This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
	\item \textbf{\Ac{abp} Mode}:
	      With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
	      In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
	      In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
\end{enumerate}

Both in \ac{ab} mode and in \ac{ab}-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.

\subsubsection{Processing Unit}

Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers.
The \ac{simd} \ac{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
One program that is stored in the \ac{crf} is called a \textit{microkernel}.
As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}.
Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.

\begin{figure}
	\centering
	\includegraphics[width=0.8\linewidth]{images/pcu}
	\caption[Architecture of a \ac{pim} processing unit.]{Architecture of a \ac{pim} processing unit \cite{lee2021}.}
	\label{img:pcu}
\end{figure}

To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization would be significant.

\subsubsection{Instruction Set}

The \aca{fimdram} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
The data layout of these three instruction groups is shown in \cref{tab:isa}.

\begin{table}
	\centering
	\includegraphics[width=\linewidth]{images/isa}
	\caption[The instruction format of the processing units.]{The instruction format of the processing units \cite{lee2021}.}
	\label{tab:isa}
\end{table}

For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
It is important to note that the JUMP instruction is a zero-cycle instruction, i.e., it is executed together with the instruction that precedes it.
The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.

The DST and SRC fields specify the operand type.
That is, the register file or bank affected by the operation.
Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols.
The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., the clamping of negative values to zero, while the data is moved to another location.
Another special field \textit{A} enables the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}.

\begin{table}
	\centering
	\resizebox{\linewidth}{!}{%
	\begin{tblr}{
	  hlines,
	  vlines,
	  hline{2} = {-}{solid,black},
	  hline{2} = {2}{-}{solid,black},
	}
	Type       & Command & Description                                       & Result (DST) & Operand (SRC0) & Operand (SRC1) & Operand (SRC2) \\
	Control    & NOP     & no operation                                      &              &                &                &                \\
	Control    & JUMP    & jump instruction                                  &              &                &                &                \\
	Control    & EXIT    & exit instruction                                  &              &                &                &                \\
	Data       & MOV     & {move data\\from bank/register\\to register}      & GRF, SRF     & GRF, BANK      &                &                \\
	Data       & FILL    & {move data\\from bank/register\\to bank/register} & GRF, BANK    & GRF, BANK      &                &                \\
	Arithmetic & ADD     & addition                                          & GRF          & GRF, BANK, SRF & GRF, BANK, SRF &                \\
	Arithmetic & MUL     & multiplication                                    & GRF          & GRF, BANK      & GRF, BANK, SRF &                \\
	Arithmetic & MAC     & multiply-accumulate                               & GRF-B        & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
	Arithmetic & MAD     & multiply-and-add                                  & GRF          & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF
	\end{tblr}}
	\caption[A list of all supported \ac{pim} instructions and their possible sources and destinations.]{A list of all supported \ac{pim} instructions and their possible sources and destinations \cite{shin-haengkang2023}.}
	\label{tab:instruction_set}
\end{table}

The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
It is to note, that some operations do require specifically either a \ac{rd} or a \ac{wr} access to execute properly.
For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
Likewise, reading from the banks, requires a \ac{rd} command.
For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
For the rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.

\subsubsection{Instruction Ordering}
\label{sec:instruction_ordering}

Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
This means that the order of the respective memory accesses for such instructions is important and must not be reordered, as it must match the corresponding instruction in the microkernel.
For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for their actual memory access to receive the row and column address of the bank access.

\begin{listing}
\begin{verbatim}
MAC GRF_B #0, BANK, GRF_A #0
MAC GRF_B #1, BANK, GRF_A #1
\end{verbatim}
	\caption[Exemplary sequence of \ac{mac} instructions in a microkernel.]{Exemplary sequence of \ac{mac} instructions in a microkernel.}
	\label{lst:reorder}
\end{listing}

Unfortunately, the memory controller between the host processor and the \ac{pim} memory is allowed to reorder memory fetches as long as they do not introduce hazards.
This causes the register sources and destinations to be out of sync with the bank addresses.
One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, as only one memory transaction is handled by the controller at a time.
However, this comes at a significant performance cost and results in memory bandwidth being underutilized because the host processor has to wait for every memory access to complete.
Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.

To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions.
In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}.
With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.

\begin{figure}
	\centering
	\includegraphics[width=0.5\linewidth]{images/aam}
	\caption[Exemplary calculation of the \ac{grf}-A and \ac{grf}-B index using the row and column address.]{Exemplary calculation of the \ac{grf}-A and \ac{grf}-B index using the row and column address \cite{lee2021}.}
	\label{img:aam}
\end{figure}

As a side effect, this method also allows looping of an instruction in the microkernel, as otherwise the indices are always fixed and would therefore apply to the same register entry each time.
At the core of a \ac{gemv} microkernel is an iterative \ac{mac} instruction, followed by a JUMP instruction that executes the \ac{mac} operation a total of eight times, as shown in \cref{lst:gemv}.

\begin{listing}
\begin{verbatim}
MAC(AAM) GRF_B, BANK, GRF_A
JUMP -1, 7
\end{verbatim}
	\caption[The core of a \ac{gemv} microkernel.]{The core of a \ac{gemv} microkernel.}
	\label{lst:gemv}
\end{listing}

Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
To achieve this particular operation, where the addresses are used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
This memory layout is explained in detail in \cref{sec:memory_layout}.

\subsubsection{Programming Model}

The software stack of \aca{fimdram} is split into three main parts.
Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as non-cacheable.
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows for either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or for automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.

The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
Since \aca{hbm} memory is mainly used in conjunction with \acp{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size.
Such a group of software threads is called a thread group.
Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hbm} configuration with four memory stacks and a total of 64 \acp{pch}.

\subsubsection{Memory Layout}
\label{sec:memory_layout}

As already described in \cref{sec:instruction_ordering}, the use of the \ac{aam} mode requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, while adhering a column-major matrix layout.
In a column-major matrix layout, the entries of a column are stored sequentially before switching to the next column, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
However, the concrete element type of the array is not a single \ac{fp16} number, but a vector of 16 \ac{fp16} numbers packed together.
This results in 16 \ac{fp16} matrix row elements being stored sequentially before switching to the next 16 \ac{fp16} elements in the next row of the same 16 columns, ensuring that a \ac{simd} processing unit always contains the data of only one matrix row.

The \aca{fimdram} architecture also imposes certain dimensional constraints on the weight matrix and the input vector.
As all eight processing units in a \ac{pch} operate at the same time, the number of rows must be a multiple of eight to make use of the full processing bandwidth.
Those matrix row blocks possibly span over multiple \ac{dram} rows or even other \acp{pch}.
% Furthermore, the number of columns must be set so that exactly after one matrix row, the next bank in the \ac{pch} is addressed, so that all the processing units operate on eight different rows, stored in eight different banks, at the same time.
% This does not mean that a matrix row must be the same size as a \ac{dram} row, only that the \ac{am} of the memory controller must switch to the next bank after a complete matrix row.
% Once all banks have been accessed, the mapping of the column bits can continue.
Furthermore, the number of columns defines the number of iterations the \ac{mac} core of the microkernel has to perform.
As always 16 \ac{fp16} elements are packed together in a column-major fashion, and while ensuring that the \ac{am} of the memory controller switches to the next bank after exactly one burst size, the \ac{pim} units each contain 16 different matrix row elements of the same set of 16 matrix columns.
\Cref{img:matrix_layout} gives a complete overview of the layout of the weight matrix in the linear address space and its mapping onto the memory banks.
Note, that the interleaving of \ac{fp16} vectors is very similar to the chunking of the weight matrix of SK Hynix's Newton architecture, as illustrated in \cref{img:hynix}.

The input vector must adhere also a special memory layout.
Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
However, because all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
As there is no communication between the banks, every bank needs to have its own copy of the input vector.
Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
This interleaving is illustrated in \cref{img:input_vector}.

\begin{figure}
	\centering
	\input{images/input_vector}
	\caption{Input vector in linear address space, where one chunk is mapped to all banks.}
	\label{img:input_vector}
\end{figure}

To initialize the input vector in this way, the host processor can use \ac{ab} mode.
From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.

An example with a weight matrix of dimensions (128$\times$8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as described in \cref{eq:partial_sum}:

\begin{equation}
psum[i,0:15]=\sum_{j=0}^{7}(a[j \cdot 16:j \cdot 16+15] \cdot w[i,j \cdot 16:j \cdot 16+15])
\label{eq:partial_sum}
\end{equation}

The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
Note that consequently the activation function often used in \acp{dnn}, i.e., \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function.
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.

\begin{figure}
	\centering
	\includegraphics[width=0.8\linewidth]{images/memory_layout}
	\caption[Procedure to perform a (128$\times$8)$\times$(128) \ac{gemv} operation.]{Procedure to perform a (128$\times$8)$\times$(128) \ac{gemv} operation. One cell represents 16 \ac{fp16} elements, forming a $\qty{32}{\byte}$ block \cite{kang2022}.}
	\label{img:memory_layout}
\end{figure}

In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
Increasing the number of rows in the matrix requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
However, the additional matrix rows must be stored as a separate matrix after the first 8-row matrix block, forming an array of separate 8-row matrices.
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8 \cdot 8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
\begin{listing}
\begin{verbatim}
MAC(AAM) GRF_B, BANK, GRF_A
JUMP -1, 63
\end{verbatim}
	\caption[The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.]{The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.}
	\label{lst:gemv64}
\end{listing}
A further increase in the total number of rows can be achieved by distributing the weight matrix over multiple \acp{pch} and running the microkernel multiple times, concatenating the output vectors on the host at the end.

To increase the number of columns, new entries of the input vector must be loaded into the processing units.
Therefore, it is necessary to execute the entire \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns, and merge the resulting output vectors by adding them on the host.
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.

\subsubsection{Performance and Power Efficiency Effects}
\label{sec:fimdram_performance}

In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements.
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that is integrated onto the same silicon interposer as four \aca{hbm} stacks, with each consisting of one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}.
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.

The power consumption of the \aca{fimdram} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.

In conclusion, \aca{fimdram} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \aca{fimdram} model in a simulator.

\begin{landscape}
\begin{figure}
\input{images/matrix_layout}
\caption[Mapping of the weight matrix onto the memory banks and its layout in the linear address space.]{Mapping of the weight matrix onto the memory banks and its layout in the linear address space.}
\label{img:matrix_layout}
\end{figure}
\end{landscape}