master-thesis/src/chapters/pim.tex

\section{Processing-in-Memory}
\label{sec:pim}

% Allgemeiner overview hier...
% wird seit 70ern diskutiert...
% durch DNNs neuer Aufwind...

\subsection{Applicable Workloads}
\label{sec:pim_workloads}

As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
For problems like this, \ac{pim} is only of limited use.

Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
Pairwise multiplication of the input vector and a row of the matrix can be used to calculate an entry of the output vector.
This process is illustrated in Figure \ref{img:dnn} where one \ac{dnn} layer is processed.

\begin{figure}
	\centering
	\input{images/dnn}
	\caption[A fully connected \ac{dnn} layer]{A fully connected \ac{dnn} layer \cite{he2020}.}
	\label{img:dnn}
\end{figure}

Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
As a result, such an operation is a good fit for \ac{pim}.

\subsection{PIM Architectures}
\label{sec:pim_architectures}

Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:

\begin{enumerate}
\item Inside the memory \ac{subarray}.
\item In the \ac{psa} region near a \ac{subarray}.
\item Outside the bank in its peripheral region.
\item In the \ac{io} region of the memory.
\end{enumerate}

Each of these approaches comes with different advantages and disadvantages.
In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.

Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
An example of such an approach is Ambit \cite{seshadri2020}.
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.

Far fewer, but still challenging, constraints are placed on the integration of compute units in the region of the \acp{psa}.
\cite{sudarshan2022a} presents a two-stage design that integrates current mirror-based analogue units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications possible.

The integration of compute units in the \ac{io} region of the bank allows for area intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.

Placing the compute units in the \ac{io} region of the \ac{dram} has the fewest physical limitations and allows for complex accelerators possible.
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.

In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.

\subsection{UPMEM}
\label{sec:pim_upmem}

\subsection{Newton AiM}
\label{sec:pim_newton}

% gddr (device-based)

\subsection{FIMDRAM/HBM-PIM}
\label{sec:pim_fim}
% unterschiede zu hynix pim