80 lines
4.9 KiB
TeX
80 lines
4.9 KiB
TeX
\section{Processing-in-Memory}
|
|
\label{sec:pim}
|
|
|
|
% Allgemeiner overview hier...
|
|
% wird seit 70ern diskutiert...
|
|
% durch DNNs neuer Aufwind...
|
|
|
|
\subsection{Applicable Workloads}
|
|
\label{sec:pim_workloads}
|
|
|
|
As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
|
|
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
|
|
For problems like this, \ac{pim} is only of limited use.
|
|
|
|
Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
|
|
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
|
|
Pairwise multiplication of the input vector and a row of the matrix can be used to calculate an entry of the output vector.
|
|
This process is illustrated in Figure \ref{img:dnn} where one \ac{dnn} layer is processed.
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\input{images/dnn}
|
|
\caption[A fully connected \ac{dnn} layer]{A fully connected \ac{dnn} layer \cite{he2020}.}
|
|
\label{img:dnn}
|
|
\end{figure}
|
|
|
|
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
|
|
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
|
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
|
As a result, such an operation is a good fit for \ac{pim}.
|
|
|
|
\subsection{PIM Architectures}
|
|
\label{sec:pim_architectures}
|
|
|
|
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
|
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
|
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
|
|
|
|
\begin{enumerate}
|
|
\item Inside the memory \ac{subarray}.
|
|
\item In the \ac{psa} region near a \ac{subarray}.
|
|
\item Outside the bank in its peripheral region.
|
|
\item In the \ac{io} region of the memory.
|
|
\end{enumerate}
|
|
|
|
Each of these approaches comes with different advantages and disadvantages.
|
|
In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
|
|
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
|
|
|
Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
|
|
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
|
|
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
|
|
An example of such an approach is Ambit \cite{seshadri2020}.
|
|
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
|
|
|
|
Far fewer, but still challenging, constraints are placed on the integration of compute units in the region of the \acp{psa}.
|
|
\cite{sudarshan2022a} presents a two-stage design that integrates current mirror-based analogue units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications possible.
|
|
|
|
The integration of compute units in the \ac{io} region of the bank allows for area intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
|
|
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
|
|
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.
|
|
|
|
Placing the compute units in the \ac{io} region of the \ac{dram} has the fewest physical limitations and allows for complex accelerators possible.
|
|
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
|
|
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.
|
|
|
|
In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.
|
|
|
|
\subsection{UPMEM}
|
|
\label{sec:pim_upmem}
|
|
|
|
\subsection{Newton AiM}
|
|
\label{sec:pim_newton}
|
|
|
|
% gddr (device-based)
|
|
|
|
\subsection{FIMDRAM/HBM-PIM}
|
|
\label{sec:pim_fim}
|
|
% unterschiede zu hynix pim
|