From 35d971a2984f9941903efc4726255289851c1bae Mon Sep 17 00:00:00 2001 From: Derek Christ Date: Mon, 5 Feb 2024 19:11:23 +0100 Subject: [PATCH] More PIM overview --- src/acronyms.tex | 8 ++++++++ src/chapters/pim.tex | 25 +++++++++++++++++++++---- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/src/acronyms.tex b/src/acronyms.tex index f4e845c..1292e0f 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -163,6 +163,14 @@ short = GEMM, long = matrix matrix multiply, } +\DeclareAcronym{mac}{ + short = MAC, + long = multiply-accumulate, +} +\DeclareAcronym{mad}{ + short = MAD, + long = multiply-add, +} \DeclareAcronym{tlm}{ short = TLM, long = transaction-level modeling, diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index de3e0e2..7ab4b40 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -8,7 +8,7 @@ \subsection{Applicable Workloads} \label{sec:pim_workloads} -As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads. +As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity. In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth. For problems like this, \ac{pim} is only of limited use. @@ -44,11 +44,28 @@ In essence, these placements of the approaches can be summarised as follows \cit \end{enumerate} Each of these approaches come with different advantages and disadvantages. -In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the achievable processing bandwidth. -But also, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration. +In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the energy efficiency achievable processing bandwidth. +On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration \cite{sudarshan2022}. % kurzer overview über die kategorien von PIM (paper vom lehrstuhl) -In the following, three \ac{pim} approaches are highlighted in more detail. +Processing inside the \ac{subarray} has the highest achievable level of parallelism with the number of operand bits equal to the size of the row. +It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it. +The downside of this approach is the need to alter the highly for density optimized \ac{subarray} architecture. +One example for such an approach is Ambit \cite{seshadri2020}. +Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data. + +Far fewer but still challenging constraints are posed onto the integration of compute units in the region of the \acp{psa}. +\cite{sudarshan2022a} introduces a two-stage design that integrates current mirror based analog units near the \ac{subarray} that make \ac{mac} operations used in \ac{dnn} applications possible. + +The integration of compute units at the \ac{io} region of the bank makes area-intensive operations such as ADD, \ac{mac} or \ac{mad} possible. +This leaves the highly optimized \ac{subarray} and \ac{psa} region as-is and only reduces the memory density by reducing the density per die to make space for the additional compute units. +However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture i.e., the maximum burst size of the memory banks. + +Placing the compute units at the \ac{io} region of the \ac{dram} has the fewest physical restrictions and makes complex accelerators possible. +On the downside, the bank parallelism cannot be exploited to perform multiple computations concurrently on a bank-wise level. +Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is far higher than in the other approaches. + +In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail. \subsection{UPMEM} \label{sec:pim_upmem}