Update on Overleaf.

This commit is contained in:
2024-03-26 09:47:58 +00:00
committed by node
parent 407848ada7
commit 86c6a326ee
2 changed files with 77 additions and 62 deletions

View File

@@ -35,6 +35,10 @@
alt = FIMDRAM, alt = FIMDRAM,
long = Fun\-ction-In-Memory DRAM, long = Fun\-ction-In-Memory DRAM,
} }
\DeclareAcronym{hbm}{
short = HBM,
long = High Bandwidth Memory,
}
\DeclareAcronym{hbm2}{ \DeclareAcronym{hbm2}{
short = HBM2, short = HBM2,
long = High Bandwidth Memory 2, long = High Bandwidth Memory 2,
@@ -51,10 +55,10 @@
short = FPU, short = FPU,
long = floating-point unit, long = floating-point unit,
} }
\DeclareAcronym{fp}{ % \DeclareAcronym{fp}{
short = FP, % short = FP,
long = floating-point, % long = floating-point,
} % }
\DeclareAcronym{crf}{ \DeclareAcronym{crf}{
short = CRF, short = CRF,
long = command register file, long = command register file,
@@ -123,10 +127,10 @@
short = AAM, short = AAM,
long = address aligned mode, long = address aligned mode,
} }
\DeclareAcronym{mac}{ % \DeclareAcronym{mac}{
short = MAC, % short = MAC,
long = multiply-accumulate, % long = multiply-accumulate,
} % }
\DeclareAcronym{haxpy}{ \DeclareAcronym{haxpy}{
short = HAXPY, short = HAXPY,
long = half precision $a \cdot x + y$, long = half precision $a \cdot x + y$,
@@ -135,14 +139,14 @@
short = ReLU, short = ReLU,
long = rectified linear unit, long = rectified linear unit,
} }
\DeclareAcronym{gpu}{ % \DeclareAcronym{gpu}{
short = GPU, % short = GPU,
long = graphics processing unit, % long = graphics processing unit,
} % }
\DeclareAcronym{fpga}{ % \DeclareAcronym{fpga}{
short = FPGA, % short = FPGA,
long = field-programmable gate array, % long = field-programmable gate array,
} % }
\DeclareAcronym{edp}{ \DeclareAcronym{edp}{
short = EDP, short = EDP,
long = energy-delay product, long = energy-delay product,
@@ -151,3 +155,15 @@
short = HMC, short = HMC,
long = Hybrid Memory Cube, long = Hybrid Memory Cube,
} }
\DeclareAcronym{llm}{
short = LLM,
long = large language model,
}
\DeclareAcronym{dpu}{
short = DPU,
long = DRAM processing unit,
}
% \DeclareAcronym{vp}{
% short = VP,
% long = virtual prototype,
% }

View File

@@ -71,9 +71,9 @@
\maketitle \maketitle
% %
\begin{abstract} \begin{abstract}
Data-driven applications are increasingly central to our information technology society, propelled by AI techniques reshaping various sectors of our economy and society. Despite their transformative potential, these applications demand immense data processing, leading to significant energy consumption primarily in communication and data storage rather than computation. The concept of \ac{pim} offers a solution by processing data within memory, reducing energy overheads associated with data transfer. PIM has been an enduring idea, with recent advancements in DRAM test chips integrating PIM functionality, indicating potential market adoption. Data-driven applications are increasingly central to our information technology society, propelled by AI techniques reshaping various sectors of our economy and society. Despite their transformative potential, these applications demand immense data processing, leading to significant energy consumption primarily in communication and data storage rather than computation. The concept of \ac{pim} offers a solution by processing data within memory, reducing energy overheads associated with data transfer. \Ac{pim} has been an enduring idea, with recent advancements in DRAM test chips integrating \ac{pim} functionality, indicating potential market adoption.
This paper introduces a virtual prototype of Samsung's PIM-HBM architecture, leveraging open-source tools like gem5 and DRAMSys, along with a custom Rust software library facilitating easy utilization of PIM functionality. Key contributions include the first full-system simulation of HBM-PIM, experimental validation of the virtual platform with benchmarks, and the development of a Rust library enabling PIM functionality at the software level. This paper introduces a virtual prototype of Samsung's PIM-HBM architecture, leveraging open-source tools like gem5 and DRAMSys, along with a custom Rust software library facilitating easy utilization of \ac{pim} functionality. Key contributions include the first full-system simulation of PIM-HBM, experimental validation of the virtual platform with benchmarks, and the development of a Rust library enabling \ac{pim} functionality at the software level.
TODO: Benchmark results TODO: Benchmark results
\keywords{DRAM \and PIM \and Virtual Platforms} \keywords{DRAM \and PIM \and Virtual Platforms}
\end{abstract} \end{abstract}
@@ -83,44 +83,45 @@ TODO: Benchmark results
\section{Introduction} \section{Introduction}
\label{sec:intro} \label{sec:intro}
% TODO Matthias % TODO Matthias
Data-driven applications are increasingly becoming the focal point of our information technology society, with AI techniques fundamentally altering various sectors of our society and economy. A common characteristic of these applications is the vast amount of data they require to be captured, stored, and processed. Consequently, many of these applications, e.\,g. large language models (LLM) or other artificial intelligence workloads are bound by the memory performance. Data-driven applications are increasingly becoming the focal point of our information technology society, with AI techniques fundamentally altering various sectors of our society and economy. A common characteristic of these applications is the vast amount of data they require to be captured, stored, and processed. Consequently, many of these applications, e.\,g., \acp{llm} or other artificial intelligence workloads are bound by the memory performance.
Furthermore, a significant portion of energy is consumed by communication and data storage rather than computation. As demonstrated by Jouppi et al.~\cite{jouhyu_21}, in a 7nm process, a 32-bit floating-point multiplication requires \qty{1.31}{\pico\joule}, whereas a 64-bit DRAM memory access demands \qty{1300}{\pico\joule}. This energy is expended in transferring data from memory through the network on chip, arbiters, and various levels of caches. Hence, it would be considerably more energy-efficient to process data where it resides, particularly within the memory itself. In other words, rather than transmitting data to computational units, the computational instructions should be sent to the memory housing the data. Furthermore, a significant portion of energy is consumed by communication and data storage rather than computation. As demonstrated by Jouppi et al.~\cite{jouhyu_21}, in a 7nm process, a 32-bit floating-point multiplication requires \qty{1.31}{\pico\joule}, whereas a 64-bit DRAM memory access demands \qty{1300}{\pico\joule}. This energy is expended in transferring data from memory through the network on chip, arbiters, and various levels of caches. Hence, it would be considerably more energy-efficient to process data where it resides, particularly within the memory itself. In other words, rather than transmitting data to computational units, the computational instructions should be sent to the memory housing the data.
This concept, known as \ac{pim}, has been around for many years. For instance, Stone already proposed it in the 1970s~\cite{sto_70}. Since then, similar to the field of artificial intelligence, this idea has experienced \enquote{summer} and \enquote{winter} periods in research over the past decades. However, recently, different companies have developed DRAM test chips with integrated PIM functionality, showing promising potential for entry into the commodity market. This concept, known as \ac{pim}, has been around for many years. For instance, Stone already proposed it in the 1970s~\cite{sto_70}. Since then, similar to the field of artificial intelligence, this idea has experienced \enquote{summer} and \enquote{winter} periods in research over the past decades. However, recently, different companies have developed DRAM test chips with integrated \ac{pim} functionality, showing promising potential for entry into the commodity market.
For instance, UPMEM introduced the first publicly available real-world PIM architecture~\cite{gomhaj_21}. UPMEM integrates standard DDR4 DIMM-based DRAM with a series of PIM-enabled UPMEM DIMMs containing multiple PIM chips. Each PIM chip houses eight DRAM processing units (DPUs), each with dedicated access to a 64 MiB memory bank, a 24 KiB instruction memory, and a 64 KiB scratchpad memory. These DPUs function as multithreaded 32-bit reduced instruction set computer (RISC) cores, featuring a complete set of general-purpose registers and a 14-stage pipeline~\cite{gomhaj_21}. In 2020, SK Hynix, a leading DRAM manufacturer, unveiled its PIM technology, named Newton, utilizing Graphics Double Data Rate 6 (GDDR6) memory~\cite{he2020}. Unlike UPMEM, Newton integrates small MAC units and buffers into the bank area to mitigate the space and power overhead of a fully programmable processor core. Following SK Hynix's lead, Samsung, another major DRAM manufacturer, announced its own PIM DRAM implementation named Function-In-Memory DRAM (FIMDRAM or PIM-HBM) one year later~\cite{lee2021}. For instance, UPMEM introduced the first publicly available real-world \ac{pim} architecture~\cite{gomhaj_21}. UPMEM integrates standard DDR4 DIMM-based DRAM with a series of PIM-enabled UPMEM DIMMs containing multiple \ac{pim} chips. Each \ac{pim} chip houses eight \acp{dpu}, each with dedicated access to a 64 MiB memory bank, a 24 KiB instruction memory, and a 64 KiB scratchpad memory. These \acp{dpu} function as multithreaded 32-bit \ac{risc} cores, featuring a complete set of general-purpose registers and a 14-stage pipeline~\cite{gomhaj_21}.
In 2020, SK Hynix, a leading DRAM manufacturer, unveiled its \ac{pim} technology, named Newton, utilizing \ac{hbm}~\cite{he2020}. Unlike UPMEM, Newton integrates small MAC units and buffers into the bank area to mitigate the space and power overhead of a fully programmable processor core. Following SK Hynix's lead, Samsung, another major DRAM manufacturer, announced its own \ac{pim} DRAM implementation named \ac{fimdram} one year later~\cite{lee2021}.
With these new architectures on the horizon, it becomes crucial for system-level designers to assess whether these promising developments can enhance their applications. Furthermore, these emerging hardware architectures necessitate new software paradigms. It remains unclear whether libraries, compilers, or operating systems will effectively manage these new devices at the software level. Therefore, it is imperative to establish comprehensive virtual platforms for these devices, enabling real applications to be tested within a realistic architectural and software platform context. With these new architectures on the horizon, it becomes crucial for system-level designers to assess whether these promising developments can enhance their applications. Furthermore, these emerging hardware architectures necessitate new software paradigms. It remains unclear whether libraries, compilers, or operating systems will effectively manage these new devices at the software level. Therefore, it is imperative to establish comprehensive virtual platforms for these devices, enabling real applications to be tested within a realistic architectural and software platform context.
This paper introduces a virtual prototype of Samsung's PIM-HBM, developed using open-source tools such as gem5~\cite{lowahm_20} and the memory simulator \mbox{DRAMSys~\cite{stejun_20}}. Additionally, the virtual prototype is accompanied by a custom Rust software library, simplifying the utilization of PIM functionality at the software level. This paper introduces a virtual prototype of Samsung's \ac{fimdram}, developed using open-source tools such as gem5~\cite{lowahm_20} and the memory simulator \mbox{DRAMSys~\cite{stejun_20}}. Additionally, the virtual prototype is accompanied by a custom Rust software library, simplifying the utilization of \ac{pim} functionality at the software level.
In summary, this paper makes the following contributions: In summary, this paper makes the following contributions:
\begin{itemize} \begin{itemize}
\item We propose, to the best of our knowledge, for the first time full system simulation of HBM-PIM with a virtual plattform consisting of gem5 and DRAMSys \item We propose, to the best of our knowledge, for the first time full system simulation of \ac{fimdram} with a virtual platform consisting of gem5 and DRAMSys
\item We provide an experimantal verification of VP with benchmarks \item We provide an experimental verification of VP with benchmarks
\item We propose a modern Rust library to provide the PIM functionality up to the software level \item We propose a modern Rust library to provide the \ac{pim} functionality up to the software level
\end{itemize} \end{itemize}
The paper is structured as follows. Section 2 shows the related work in the area of PIM-Simulation. Section 3 gives a brief background on the relative PIM-Architectures, whereas Section 4 explains the proposed PIM Virtual Platform. The Sections 5 and 6 show experimental simulation setup and the results, which are compared with already published results from PIM vendors. The paper is finally concluded in Section 7. The paper is structured as follows. Section 2 shows the related work in the area of \ac{pim}-Simulation. Section 3 gives a brief background on the relative \ac{pim}-Architectures, whereas Section 4 explains the proposed \ac{pim} Virtual Platform. The Sections 5 and 6 show experimental simulation setup and the results, which are compared with already published results from \ac{pim} vendors. The paper is finally concluded in Section 7.
% %
\section{Related Work} \section{Related Work}
Several virtual prototypes of \ac{pim} architectures have been object to research in the past. Several virtual prototypes of \ac{pim} architectures have been object to research in the past.
The authors of \cite{singh2019} and \cite{kim2016a} used Ramulator-PIM, which is based on the processor simulator ZSim \cite{sanchez2013} and the DRAM simulator Ramulator \cite{kim2016a}, to build high-level performance and energy estimation frameworks. The authors of \cite{singh2019} and \cite{kim2016a} used Ramulator-PIM, which is based on the processor simulator ZSim \cite{sanchez2013} and the DRAM simulator Ramulator \cite{kim2016a}, to build high-level performance and energy estimation frameworks.
C. Yu et al. \cite{yu2021} introduced MultiPIM, a high-level \ac{pim} simulator capable of simulating parallel \ac{pim} cores, which is also based on Ramulator and ZSim. C. Yu et al. \cite{yu2021} introduced MultiPIM, a high-level \ac{pim} simulator capable of simulating parallel \ac{pim} cores, which is also based on Ramulator and ZSim.
However, these three publications focus mainly on \ac{hmc}, which has seen only limited adoption. However, these three publications focus primarily on \ac{hmc} DRAM, which has seen limited adoption.
With PIMSim \cite{xu2019}, the authors provide a configurable \ac{pim} simulation framework that enables a full-system simulation of user-specified \ac{pim} logic cores. With PIMSim \cite{xu2019}, the authors provide a configurable \ac{pim} simulation framework that enables a full-system simulation of user-specified \ac{pim} logic cores.
The authors of DP-Sim \cite{zhou2021} present a full-stack infrastructure for \ac{pim}, based on a front-end that generates \ac{pim} instructions by instrumenting a host application and executing them in a \ac{pim}-enabled memory model. The authors of DP-Sim \cite{zhou2021} present a full-stack infrastructure for \ac{pim}, based on a front-end that generates \ac{pim} instructions by instrumenting a host application and executing them in a \ac{pim}-enabled memory model.
Similarly, Sim\textsuperscript{2}PIM \cite{santos2021,forlin2022} uses instrumentation to simulate only the \ac{pim} side of a host application. Similarly, Sim\textsuperscript{2}PIM \cite{santos2021,forlin2022} uses instrumentation to simulate only the \ac{pim} side of a host application.
The MPU-Sim \cite{xie2022} simulator focuses on general-purpose near-bank processing units based on 3D DRAM technology, while neglecting the data transfers between the host CPU and the \ac{pim} devices. The MPU-Sim \cite{xie2022} simulator focuses on general-purpose near-bank processing units based on 3D DRAM technology, while neglecting the data transfers between the host CPU and the \ac{pim} devices.
These instrumentation approaches are less accurate when it comes to integration with the host processor because they focus on simulating the \ac{pim} modules. These instrumentation approaches are less accurate when it comes to integration with the host processor because they primarily focus on simulating the \ac{pim} units.
A slightly different approach is taken by PiMulator \cite{mosanu2022}, which does not simulate but emulates \ac{pim} implementations such as RowClone \cite{seshadri2013} or Ambit \cite{seshadri2020} by implementing a soft-model in an FPGA. A slightly different approach is taken by PiMulator \cite{mosanu2022}, which does not simulate but emulates \ac{pim} implementations such as RowClone \cite{seshadri2013} or Ambit \cite{seshadri2020} by implementing a soft-model in an FPGA.
Besides research \ac{pim} architectures, there are also virtual prototypes of industry architectures. Besides research \ac{pim} architectures, there are also virtual prototypes of industry architectures.
Very recently, the authors of \cite{hyun2024} introduced uPIMulator, a cycle-accurate simulator that models UPMEM's real-world general-purpose \ac{pim} architecture. Very recently, the authors of \cite{hyun2024} introduced uPIMulator, a cycle-accurate simulator that models UPMEM's real-world general-purpose \ac{pim} architecture.
To analyze the potential performance and power impact of Newton, SK Hynix developed a virtual prototype based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator, which models an \ac{hbm2} memory and the extended Newton DRAM protocol. However, DRAMSym2 is more than 10 years old and several orders of magnitude slower than DRAMSys~\cite{steiner2022a}. To analyze the potential performance and power impact of Newton, SK Hynix developed a virtual prototype based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator, which models a \ac{hbm2} memory and the extended Newton DRAM protocol. However, DRAMSym2 is more than 10 years old and several orders of magnitude slower than DRAMSys~\cite{steiner2022a}.
The simulated system is compared to two different non-\ac{pim} systems: an ideal non-\ac{pim} host with infinite compute bandwidth and a \ac{gpu} model of a high-end Titan-V graphics card using a cycle-accurate \ac{gpu} simulator. The simulated system is compared to two different non-\ac{pim} systems: an ideal non-\ac{pim} host with infinite compute bandwidth and a GPU model of a high-end Titan-V graphics card using a cycle-accurate GPU simulator.
SK Hynix finds that Newton achieves a \qty{54}{\times} speedup over the Titan-V \ac{gpu} model and a speedup of \qty{10}{\times} for the ideal non-\ac{pim} case, setting a lower bound on the acceleration for every possible non-\ac{pim} architecture. SK Hynix finds that Newton achieves a \qty{54}{\times} speedup over the Titan-V GPU model and a speedup of \qty{10}{\times} for the ideal non-\ac{pim} case, setting a lower bound on the acceleration for every possible non-\ac{pim} architecture.
With PIMSimulator~\cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} also based on DRAMSim2. With PIMSimulator~\cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram}, also based on DRAMSim2.
PIMSimulator offers two simulation modes: it can either accept pre-recorded memory traces or generate very simplified memory traffic using a minimal host processor model that essentially executes only the \ac{pim}-related program regions. PIMSimulator offers two simulation modes: it can either accept pre-recorded memory traces or generate very simplified memory traffic using a minimal host processor model that essentially executes only the \ac{pim}-related program regions.
However, neither approach accurately models a complete system consisting of a host processor running a real compiled binary and the memory system that integrates \ac{fimdram}. However, neither approach accurately models a complete system consisting of a host processor running a real compiled binary and the memory system that integrates \ac{fimdram}.
As a result, only limited conclusions can be made about the performance impact of \ac{fimdram} and the changes that are required in the application code to support the new architecture. As a result, only limited conclusions can be made about the performance impact of \ac{fimdram} and the changes that are required in the application code to support the new architecture.
@@ -129,10 +130,10 @@ In Samsung's findings, the simulated \ac{fimdram} system provides a speedup in t
\section{Background DRAM-PIM} \section{Background DRAM-PIM}
\label{sec:dram_pim} \label{sec:dram_pim}
Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}. Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity. \Ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth. In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
A large number of modern \acp{dnn} layers can be expressed as a matrix-vector multiplication. A large number of modern \ac{dnn} layers can be expressed as a matrix-vector multiplication.
The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector. The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector. Pairwise multiplication of the input vector and a row of the matrix are be used to calculate an entry of the output vector.
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine. Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
@@ -146,12 +147,12 @@ Each of these approaches comes with different advantages and disadvantages.
In short, the closer the processing is to the DRAM's subarray, the higher the energy efficiency and the achievable processing bandwidth. In short, the closer the processing is to the DRAM's subarray, the higher the energy efficiency and the achievable processing bandwidth.
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}. On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
One real \ac{pim} implementation of the major DRAM manufacturer Samsung, called \acf{fimdram}, has been presented in 2021 \cite{kwon2021,lee2021}. One real \ac{pim} implementation of the DRAM manufacturer Samsung, called \acf{fimdram}, has been presented in 2021 \cite{kwon2021,lee2021}.
\Ac{fimdram} is based on the \ac{hbm2} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized memory subarray \cite{kwon2021}. \Ac{fimdram} is based on the \ac{hbm2} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized memory subarray \cite{kwon2021}.
A special feature of \aca{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm2} platforms. A special feature of \ac{fimdram} is that it does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \ac{hbm2} platforms.
Consequently, for the operation of the \acp{pu}, mode switching is required for \aca{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes. Consequently, for the operation of the \acp{pu}, mode switching is required for \ac{fimdram}, which makes it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
At the heart of \aca{fimdram} lie the \acp{pu}, where one of which is shared by two banks of the same \ac{pch}. At the heart of \ac{fimdram} lie the \acp{pu}, where one of which is shared by two banks of the same \ac{pch}.
The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}. The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
\begin{figure} \begin{figure}
@@ -163,30 +164,30 @@ The architecture of such a \ac{pu} is illustrated in \cref{fig:pu}.
A \ac{pu} contains two sets of \ac{simd} \acp{fpu}, one for addition and one for multiplication, where each set contains 16 16-bit wide \acp{fpu} each. A \ac{pu} contains two sets of \ac{simd} \acp{fpu}, one for addition and one for multiplication, where each set contains 16 16-bit wide \acp{fpu} each.
Besides the \acp{fpu}, a \ac{pu} contains a \ac{crf}, a \ac{grf} and a \ac{srf} \cite{lee2021}. Besides the \acp{fpu}, a \ac{pu} contains a \ac{crf}, a \ac{grf} and a \ac{srf} \cite{lee2021}.
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access. The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \ac{hbm2}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}. As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{\acp{pu}}=\qty{2048}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor. As a result, the theoretical internal bandwidth of \ac{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
\Ac{fimdram} defines three operating modes: \Ac{fimdram} defines three operating modes:
The default \textbf{\ac{sb} mode}, where \aca{fimdram} has identical behavior to normal \aca{hbm2} memory. The default \textbf{\ac{sb} mode}, where \ac{fimdram} has identical behavior to normal \ac{hbm2} memory.
To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to specific row addresses. To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to specific row addresses.
The \textbf{\ac{ab} mode} is an extension to the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the DRAM banks at the same time. The \textbf{\ac{ab} mode} is an extension to the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the DRAM banks at the same time.
This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks. This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
With another predefined DRAM access sequence, the memory switches to the \textbf{\ac{abp} mode}. With another predefined DRAM access sequence, the memory switches to the \textbf{\ac{abp} mode}.
In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units. In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
In addition, the I/O circuits of the DRAM are completely disabled in this mode, reducing the power required during \ac{pim} operation. In addition, the I/O circuits of the DRAM for the data bus are completely disabled in this mode, reducing the power required during \ac{pim} operation.
Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm2} bandwidth per \ac{pch} of $\qty{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty{128}{\giga\byte\per\second}$ or in total $\qty{2}{\tera\byte\per\second}$ for 16 \acp{pch}. Both in \ac{ab} mode and in \ac{abp} mode, the total \ac{hbm2} bandwidth per \ac{pch} of $\qty{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty{128}{\giga\byte\per\second}$ or in total $\qty{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to 32-bit \ac{fp} numbers. Due to the focus on \ac{dnn} applications in \ac{fimdram}, the native data type for the \acp{fpu} are \ac{fp16} numbers, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to 32-bit floating-point numbers.
The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access. The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
A program that is stored in the \ac{crf} is called a \textit{microkernel}. A program that is stored in the \ac{crf} is called a \textit{microkernel}.
Each \ac{grf} consists of 16 registers, each with the \aca{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst. Each \ac{grf} consists of 16 registers, each with the \ac{hbm2} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks. The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated $\qty{16}{\times}$ as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication. Finally, in the \acp{srf}, a 16-bit scalar value is replicated $\qty{16}{\times}$ as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each. It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each.
The \aca{fimdram} instruction set provides a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions (NOP, JUMP, EXIT), arithmetic instructions (ADD, MUL, MAC, MAD) and data movement instructions (MOV, FILL). The \ac{fimdram} instruction set provides a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions (NOP, JUMP, EXIT), arithmetic instructions (ADD, MUL, MAC, MAD) and data movement instructions (MOV, FILL).
Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel. Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
When an instruction executes directly on data that is provided by a memory bank, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to. When an instruction executes directly on data that is provided by a memory bank, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
@@ -194,32 +195,32 @@ This means that the order of the respective memory accesses for such instruction
One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, however this comes at a significant performance cost and results in memory bandwidth being underutilized. One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, however this comes at a significant performance cost and results in memory bandwidth being underutilized.
To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions. To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions.
In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself. In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself.
With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses. With this method, the register indices and the bank addresses cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
\section{PIM Virtual Plattform} \section{PIM Virtual Plattform}
To build a virtual prototype of \aca{fimdram}, an accurate \ac{hbm2} model is needed, where the additional \ac{pim}-\acp{pu} are integrated. To build a virtual prototype of \ac{fimdram}, an accurate model for \ac{hbm2} is needed, where the additional \ac{pim}-\acp{pu} are integrated.
For this, the cycle-accurate DRAM simulator DRAMSys \cite{steiner2022a} was used and its \ac{hbm2} model was extended to include the \acp{pu} in the \acp{pch} of the \ac{pim} activated channels. For this, the cycle-accurate DRAM simulator DRAMSys \cite{steiner2022a} was used and its \ac{hbm2} model was extended to include the \acp{pu} in the \acp{pch} of the \ac{pim} activated channels.
The \aca{fimdram} model itself does not need to model any timing behavior: The \ac{fimdram} model itself does not need to model any timing behavior:
Its submodel is essentially untimed, since it is already synchronized with the operation of the DRAM model of DRAMSys. its submodel is essentially untimed, since it is already synchronized with the operation of the DRAM model of DRAMSys.
To achieve a full-system simulation, detailed processor and cache models are required in addition to the \ac{pim}-enabled memory system. To achieve a full-system simulation, detailed processor and cache models are required in addition to the \ac{pim}-enabled memory system.
For this, the gem5 simulator was used, which generates memory requests by executing the instructions of a compiled workload binary. For this, the gem5 simulator was used, which generates memory requests by executing the instructions of a compiled workload binary.
While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm2} memory. While \ac{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \ac{hbm2} memory.
Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active. Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active.
When entering \ac{ab} mode, the DRAM model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index. When entering \ac{ab} mode, the DRAM model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index.
After the transition to the \ac{ab} mode, the DRAM can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units. After the transition to the \ac{ab} mode, the DRAM can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units.
The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units. The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units.
In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor. In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor.
This is equivalent to the real \aca{fimdram} implementation, where the global I/O bus of the memory is not actually driven, and all data movement is done internally in the banks. This is equivalent to the real \ac{fimdram} implementation, where the global I/O bus of the memory is not actually driven, and all data movement is done internally in the banks.
The model's internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction. The model's internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
Depending on a \ac{rd} or \ac{wr} command received from the DRAM model, the control flow is dispatched into one of two functions that execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit. Depending on a \ac{rd} or \ac{wr} command received from the DRAM model, the control flow is dispatched into one of two functions that execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit.
Both functions calculate the register indices used by the \ac{aam} execution mode followed by a branch table that dispatches to the handler of the current instruction. Both functions calculate the register indices used by the \ac{aam} execution mode followed by a branch table that dispatches to the handler of the current instruction.
In case of the data movement instructions MOV and FILL, a simple move operation that loads to value of one register or the bank data and assigns it to the destination register is performed. In case of the data movement instructions MOV and FILL, a simple move operation that loads to value of one register or the bank data and assigns it to the destination register is performed.
The arithmetic instructions fetch the operand data is from their respective sources and perform the operation, and write back the result by modifying the internal state of the \ac{pu}. The arithmetic instructions fetch the operand data from their respective sources and perform the operation, and write back the result by modifying the internal state of the \ac{pu}.
Note that while the MAC instruction can iteratively add to the same destination register, but it does not reduce the 16-wide \ac{fp16} vector itself in any way. Note that while the MAC instruction can iteratively add to the same destination register, it does not reduce the 16-wide \ac{fp16} vector itself in any way.
Instead it is the host processor's responsibility of reducing these 16 floating point numbers into one \ac{fp16} number. Instead it is the host processor's responsibility to reduce these 16 floating point numbers into one \ac{fp16} number.
With this implementation of \ac{fimdram}, it is now possible to write a user program that controls the execution of the \ac{pim}-\acp{pu} directly in the \ac{hbm2} model. With this implementation of \ac{fimdram}, it is now possible to write a user program that controls the execution of the \ac{pim}-\acp{pu} directly in the \ac{hbm2} model.
However, correctly placing the input data in the DRAM and arbitrating its execution is a non-trivial task. However, correctly placing the input data in the DRAM and arbitrating its execution is a non-trivial task.
@@ -255,21 +256,20 @@ When executing control instructions or data movement instructions that operate o
Further, when data is read from or written to the memory banks, these memory requests are issued with the correct address for the data. Further, when data is read from or written to the memory banks, these memory requests are issued with the correct address for the data.
As half the banks in a \ac{pch} operate at the same time, from the viewpoint of the host processor, the data accesses occur very sparsely. As half the banks in a \ac{pch} operate at the same time, from the viewpoint of the host processor, the data accesses occur very sparsely.
In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of $\qty{8}{\times}$. In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of $\qty{8}{\times}$.
To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied. To then perform the repeated MAC operation with the weight matrix as bank data, a similar logic must be applied.
Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$. Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
As the input address of the weight matrix grows, the \ac{grf}-A and \ac{grf}-B indices are incremented in such a way that the \ac{grf}-A registers are read repeatedly to multiply the weights by the input vector, while the \ac{grf}-B registers are incremented in the outer loop to hold the results of additional matrix rows. As the input address of the weight matrix grows, the \ac{grf}-A and \ac{grf}-B indices are incremented in such a way that the \ac{grf}-A registers are read repeatedly to multiply the weights by the input vector, while the \ac{grf}-B registers are incremented in the outer loop to hold the results of additional matrix rows.
Besides generating memory requests, an important task of the software library is to maintain the data coherence of the program. Besides generating memory requests, an important task of the software library is to maintain the data coherence of the program.
The compiler may introduce invariants with respect to the value of the output vector, since it does not see that the value of the vector has changed without the host explicitly writing to it. The compiler may introduce invariants with respect to the value of the output vector, since it does not see that the value of the vector has changed without the host explicitly writing to it.
As a result, the compiler may make optimizations that are not obvious to the programmer, such as reordering memory accesses, that cause the program to execute incorrectly. As a result, the compiler may make optimizations that are not obvious to the programmer, such as reordering memory accesses, that cause the program to execute incorrectly.
To avoid this, not only between non-\ac{aam} instructions in the microkernel, the processor must introduce memory barriers after initializing the input operands and before reading the output vector, to ensure that all memory accesses and \ac{pim} operations are completed. To avoid this, not only between non-\ac{aam} instructions in the microkernel, but also after initializing the input operands and before reading the output vector, memory barriers must be introduced to ensure that all memory accesses and \ac{pim} operations are completed.
\section{Simulations} \section{Simulations}
Our simulations are based on the gem5 simulator and the DRAMSys memory simulator. Our simulations are based on the gem5 simulator and the DRAMSys memory simulator.
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity. The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity.
In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered. In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered.
This provides a lower bound on the possible speedups achieved by \ac{pim}, independent of the host architecture. This provides a lower bound on the possible speedups achieved by \ac{pim}, independent of the host architecture.
The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}. The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}.
\begin{table} \begin{table}
@@ -298,7 +298,6 @@ Our benchmarks are divided into two classes: vector benchmarks, which perform le
Both classes of benchmarks are typically memory-bound, since little or no data is reused during the operation. Both classes of benchmarks are typically memory-bound, since little or no data is reused during the operation.
For the first class of benchmarks, two \ac{fp16} vectors are added (VADD), multiplied (VMUL), or combined in a \ac{haxpy} fashion. For the first class of benchmarks, two \ac{fp16} vectors are added (VADD), multiplied (VMUL), or combined in a \ac{haxpy} fashion.
The second class of benchmarks performs a \ac{gemv} matrix-vector multiplication or models a simple fully connected neural network with multiple layers and applying the activation function \ac{relu} in between. The second class of benchmarks performs a \ac{gemv} matrix-vector multiplication or models a simple fully connected neural network with multiple layers and applying the activation function \ac{relu} in between.
Each benchmark is executed with variable operand dimensions, which are listed in \cref{tab:dimensions}. Each benchmark is executed with variable operand dimensions, which are listed in \cref{tab:dimensions}.
\begin{table} \begin{table}
@@ -331,7 +330,7 @@ The results in \cref{fig:speedups} show significant speedups for all vector benc
On the other hand, the achieved speedup for the matrix-vector simulations varied with the simulated operand dimensions. On the other hand, the achieved speedup for the matrix-vector simulations varied with the simulated operand dimensions.
The \ac{gemv} benchmark achieved a speedup in the range $\qtyrange{8.7}{9.2}{\times}$ with an average value of $\qty{9.0}{\times}$, while the fully connected neural network layers experienced a higher variance: The \ac{gemv} benchmark achieved a speedup in the range $\qtyrange{8.7}{9.2}{\times}$ with an average value of $\qty{9.0}{\times}$, while the fully connected neural network layers experienced a higher variance:
With a range of $\qtyrange{0.6}{6.0}{\times}$, the \ac{dnn} benchmark experienced both a slowdown and an acceleration of the inference time. With a range of $\qtyrange{0.6}{6.0}{\times}$, the \ac{dnn} benchmark experienced both a slowdown and an acceleration of the inference time.
Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pim} can be expected to be viable. Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pim} can be expected to become viable.
\begin{figure} \begin{figure}
\centering \centering
@@ -342,7 +341,7 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
\end{figure} \end{figure}
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages. Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages.
Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are consistent. Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are equivalent.
The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype. The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result almost an order of magnitude lower. Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result almost an order of magnitude lower.
@@ -360,7 +359,7 @@ The \ac{gemv} microbenchmark on the other hand shows a more matching result with
\Cref{fig:wallclock_time} shows the simulation runtimes of the various workloads on the host system. \Cref{fig:wallclock_time} shows the simulation runtimes of the various workloads on the host system.
With \ac{pim} enabled, the runtime drops by about an order of magnitude for some workloads, indicating the reduced simulation effort on gem5's complex processor model, as only new memory requests are issued by the model during operation of \ac{pim}. With \ac{pim} enabled, the runtime drops by about an order of magnitude for some workloads, indicating the reduced simulation effort on gem5's complex processor model, as only new memory requests are issued by the model during operation of \ac{pim}.
Therefore, exploring the effectiveness of different \ac{pim}-enabled workloads may be less time-consuming due to the reduced simulation complexity. Therefore, exploring the effectiveness of different \ac{pim}-enabled workloads may be less time-consuming than a traditional workloads due to the reduced simulation complexity.
\section{Conclusion} \section{Conclusion}
% TODO Lukas/Matthias % TODO Lukas/Matthias