UPMEM

2024-02-06 22:25:41 +01:00
parent 0c33c99c61
commit 5f52c3ae9f
2 changed files with 25 additions and 0 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -171,6 +171,18 @@
    short = MAD,
    long = multiply-add,
 }
+\DeclareAcronym{dpu}{
+    short = DPU,
+    long = DRAM Processing Units,
+}
+\DeclareAcronym{risc}{
+    short = RISC,
+    long = reduced instruction set computer,
+}
+\DeclareAcronym{sdk}{
+    short = SDK,
+    long = software development kit,
+}
 \DeclareAcronym{tlm}{
    short = TLM,
    long = transaction-level modeling,
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -75,6 +75,19 @@ In the following, three \ac{pim} approaches that place the compute units at the
 \subsection{UPMEM}
 \label{sec:pim_upmem}

+The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
+UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
+In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
+The host processor can access the memory banks to copy input data from main memory and retrieve results.
+While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
+UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout.
+
+Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
+The \acp{dpu} execute compiled C code using a specialized compiler toolchain that provides limited support of the standard library.
+With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
+A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}.
+This gives a maximum \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
+
 \subsection{Newton AiM}
 \label{sec:pim_newton}