More PIM overview

This commit is contained in:
2024-02-05 19:11:23 +01:00
parent 9bf055ba97
commit 35d971a298
2 changed files with 29 additions and 4 deletions

View File

@@ -163,6 +163,14 @@
short = GEMM,
long = matrix matrix multiply,
}
\DeclareAcronym{mac}{
short = MAC,
long = multiply-accumulate,
}
\DeclareAcronym{mad}{
short = MAD,
long = multiply-add,
}
\DeclareAcronym{tlm}{
short = TLM,
long = transaction-level modeling,

View File

@@ -8,7 +8,7 @@
\subsection{Applicable Workloads}
\label{sec:pim_workloads}
As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads.
As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
For problems like this, \ac{pim} is only of limited use.
@@ -44,11 +44,28 @@ In essence, these placements of the approaches can be summarised as follows \cit
\end{enumerate}
Each of these approaches come with different advantages and disadvantages.
In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the achievable processing bandwidth.
But also, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration.
In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the energy efficiency achievable processing bandwidth.
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration \cite{sudarshan2022}.
% kurzer overview über die kategorien von PIM (paper vom lehrstuhl)
In the following, three \ac{pim} approaches are highlighted in more detail.
Processing inside the \ac{subarray} has the highest achievable level of parallelism with the number of operand bits equal to the size of the row.
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
The downside of this approach is the need to alter the highly for density optimized \ac{subarray} architecture.
One example for such an approach is Ambit \cite{seshadri2020}.
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
Far fewer but still challenging constraints are posed onto the integration of compute units in the region of the \acp{psa}.
\cite{sudarshan2022a} introduces a two-stage design that integrates current mirror based analog units near the \ac{subarray} that make \ac{mac} operations used in \ac{dnn} applications possible.
The integration of compute units at the \ac{io} region of the bank makes area-intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
This leaves the highly optimized \ac{subarray} and \ac{psa} region as-is and only reduces the memory density by reducing the density per die to make space for the additional compute units.
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture i.e., the maximum burst size of the memory banks.
Placing the compute units at the \ac{io} region of the \ac{dram} has the fewest physical restrictions and makes complex accelerators possible.
On the downside, the bank parallelism cannot be exploited to perform multiple computations concurrently on a bank-wise level.
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is far higher than in the other approaches.
In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.
\subsection{UPMEM}
\label{sec:pim_upmem}