More PIM overview

2024-02-05 19:11:23 +01:00
parent 9bf055ba97
commit 35d971a298
2 changed files with 29 additions and 4 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -163,6 +163,14 @@
    short = GEMM,
    long = matrix matrix multiply,
 }
+\DeclareAcronym{mac}{
+    short = MAC,
+    long = multiply-accumulate,
+}
+\DeclareAcronym{mad}{
+    short = MAD,
+    long = multiply-add,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -8,7 +8,7 @@
 \subsection{Applicable Workloads}
 \label{sec:pim_workloads}

-As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads.
+As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
 In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
 For problems like this, \ac{pim} is only of limited use.

@@ -44,11 +44,28 @@ In essence, these placements of the approaches can be summarised as follows \cit
 \end{enumerate}

 Each of these approaches come with different advantages and disadvantages.
-In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the achievable processing bandwidth.
-But also, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration.
+In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the energy efficiency achievable processing bandwidth.
+On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration \cite{sudarshan2022}.
 % kurzer overview über die kategorien von PIM (paper vom lehrstuhl)

-In the following, three \ac{pim} approaches are highlighted in more detail.
+Processing inside the \ac{subarray} has the highest achievable level of parallelism with the number of operand bits equal to the size of the row.
+It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
+The downside of this approach is the need to alter the highly for density optimized \ac{subarray} architecture.
+One example for such an approach is Ambit \cite{seshadri2020}.
+Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
+
+Far fewer but still challenging constraints are posed onto the integration of compute units in the region of the \acp{psa}.
+\cite{sudarshan2022a} introduces a two-stage design that integrates current mirror based analog units near the \ac{subarray} that make \ac{mac} operations used in \ac{dnn} applications possible.
+
+The integration of compute units at the \ac{io} region of the bank makes area-intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
+This leaves the highly optimized \ac{subarray} and \ac{psa} region as-is and only reduces the memory density by reducing the density per die to make space for the additional compute units.
+However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture i.e., the maximum burst size of the memory banks.
+
+Placing the compute units at the \ac{io} region of the \ac{dram} has the fewest physical restrictions and makes complex accelerators possible.
+On the downside, the bank parallelism cannot be exploited to perform multiple computations concurrently on a bank-wise level.
+Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is far higher than in the other approaches.
+
+In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.

 \subsection{UPMEM}
 \label{sec:pim_upmem}