diff --git a/src/acronyms.tex b/src/acronyms.tex index 1292e0f..0c2e0e4 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -171,6 +171,18 @@ short = MAD, long = multiply-add, } +\DeclareAcronym{dpu}{ + short = DPU, + long = DRAM Processing Units, +} +\DeclareAcronym{risc}{ + short = RISC, + long = reduced instruction set computer, +} +\DeclareAcronym{sdk}{ + short = SDK, + long = software development kit, +} \DeclareAcronym{tlm}{ short = TLM, long = transaction-level modeling, diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index bf36189..8b572f6 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -75,6 +75,19 @@ In the following, three \ac{pim} approaches that place the compute units at the \subsection{UPMEM} \label{sec:pim_upmem} +The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}. +UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips. +In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory. +The host processor can access the memory banks to copy input data from main memory and retrieve results. +While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices. +UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout. + +Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline. +The \acp{dpu} execute compiled C code using a specialized compiler toolchain that provides limited support of the standard library. +With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$. +A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}. +This gives a maximum \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}. + \subsection{Newton AiM} \label{sec:pim_newton}