|
|
|
@@ -4,12 +4,12 @@
|
|
|
|
In the conventional von Neumann architecture, compute is completely separated from memory.
|
|
|
|
In the conventional von Neumann architecture, compute is completely separated from memory.
|
|
|
|
Memory-intensive workloads operate on a large data set, have poor spatial and temporal locality, and low operational density.
|
|
|
|
Memory-intensive workloads operate on a large data set, have poor spatial and temporal locality, and low operational density.
|
|
|
|
As a consequence, the data movement between the memory and compute forms the so-called von Neumann bottleneck \cite{zou2021}.
|
|
|
|
As a consequence, the data movement between the memory and compute forms the so-called von Neumann bottleneck \cite{zou2021}.
|
|
|
|
In the past, this bottleneck was hidden using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}.
|
|
|
|
In the past, this bottleneck was obfuscated using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}.
|
|
|
|
However, new memory-intensive applications, including \acp{dnn}, have led researchers to reconsider \ac{pim} as a new approach to meet future processing demands.
|
|
|
|
However, new memory-intensive applications, including \acp{dnn}, have led researchers to reconsider \ac{pim} as a new approach to meet future processing demands.
|
|
|
|
First proposals for \ac{pim} date back to the 1970s, were hindered by the limitations of existing memory systems, but are now experiencing a renaissance \cite{radojkovic2021,ghose2019a}.
|
|
|
|
First proposals for \ac{pim} date back to the 1970s and were hindered by the limitations of existing memory systems, but are now experiencing a renaissance \cite{radojkovic2021,ghose2019a}.
|
|
|
|
|
|
|
|
|
|
|
|
In the following, the workloads suitable for \ac{pim} will be discussed in more detail, followed by an overview of the different types of \ac{pim} implementations.
|
|
|
|
In the following, the workloads suitable for \ac{pim} will be discussed in more detail, followed by an overview of the different types of \ac{pim} implementations.
|
|
|
|
Finally, a number of concrete examples are presented.
|
|
|
|
Finally, a number of concrete implementation examples are presented.
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Applicable Workloads}
|
|
|
|
\subsection{Applicable Workloads}
|
|
|
|
\label{sec:pim_workloads}
|
|
|
|
\label{sec:pim_workloads}
|
|
|
|
@@ -34,7 +34,7 @@ Such an operation, defined in the widely used \ac{blas} library \cite{blas1979},
|
|
|
|
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
|
|
|
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
|
|
|
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
|
|
|
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
|
|
|
As a result, such an operation is a good fit for \ac{pim}.
|
|
|
|
As a result, such an operation is a good fit for \ac{pim}.
|
|
|
|
In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons.
|
|
|
|
In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons:
|
|
|
|
Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
|
|
|
|
Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
|
|
|
|
Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}.
|
|
|
|
Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}.
|
|
|
|
|
|
|
|
|
|
|
|
@@ -42,7 +42,7 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate
|
|
|
|
\label{sec:pim_architectures}
|
|
|
|
\label{sec:pim_architectures}
|
|
|
|
|
|
|
|
|
|
|
|
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
|
|
|
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
|
|
|
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
|
|
|
These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
|
|
|
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
|
|
|
|
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
|
|
|
|
|
|
|
|
|
|
|
|
\begin{enumerate}
|
|
|
|
\begin{enumerate}
|
|
|
|
@@ -58,24 +58,24 @@ Only when the compute units are placed within the bank region, the full bank par
|
|
|
|
Outside the bank region, the data retrieval is limited by the narrow memory bus.
|
|
|
|
Outside the bank region, the data retrieval is limited by the narrow memory bus.
|
|
|
|
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
|
|
|
On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
|
|
|
|
|
|
|
|
|
|
|
Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
|
|
|
|
Processing \textbf{inside the \ac{subarray}} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
|
|
|
|
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
|
|
|
|
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
|
|
|
|
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
|
|
|
|
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
|
|
|
|
An example of such an approach is Ambit \cite{seshadri2020}.
|
|
|
|
An example of such an approach is Ambit \cite{seshadri2020}.
|
|
|
|
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
|
|
|
|
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
|
|
|
|
|
|
|
|
|
|
|
|
Far fewer, but still challenging, constraints are placed on the integration of compute units in the region of the \acp{psa}.
|
|
|
|
Far fewer, but still challenging, constraints are placed on the integration of compute units \textbf{in the region of the \acp{psa}}.
|
|
|
|
\cite{sudarshan2022a} presents a two-stage design that integrates current mirror-based analogue units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications possible.
|
|
|
|
The approach presented in \cite{sudarshan2022a} consists of a two-stage design that integrates current mirror-based analog units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications.
|
|
|
|
|
|
|
|
|
|
|
|
The integration of compute units in the \ac{io} region of the bank allows for area intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
|
|
|
|
The integration of compute units \textbf{in the \ac{io} region of the bank} allows for area intensive operations such as ADD, \ac{mac} or \ac{mad}.
|
|
|
|
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
|
|
|
|
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
|
|
|
|
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.
|
|
|
|
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.
|
|
|
|
|
|
|
|
|
|
|
|
Placing the compute units in the \ac{io} region of the \ac{dram} has the fewest physical limitations and allows for complex accelerators possible.
|
|
|
|
Placing the compute units \textbf{in the \ac{io} region of the \ac{dram}} has the fewest physical limitations and allows for complex accelerators, implementing a complete \ac{isa}.
|
|
|
|
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
|
|
|
|
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
|
|
|
|
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.
|
|
|
|
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.
|
|
|
|
|
|
|
|
|
|
|
|
In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.
|
|
|
|
In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are presented in more detail.
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{UPMEM}
|
|
|
|
\subsection{UPMEM}
|
|
|
|
\label{sec:pim_upmem}
|
|
|
|
\label{sec:pim_upmem}
|
|
|
|
@@ -85,24 +85,23 @@ UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-ena
|
|
|
|
In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
|
|
|
|
In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory.
|
|
|
|
The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
|
|
|
|
The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
|
|
|
|
While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
|
|
|
|
While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
|
|
|
|
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout.
|
|
|
|
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer.
|
|
|
|
|
|
|
|
|
|
|
|
Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
|
|
|
|
Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
|
|
|
|
The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library.
|
|
|
|
The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library.
|
|
|
|
With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
|
|
|
|
With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
|
|
|
|
A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}.
|
|
|
|
A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}, which gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
|
|
|
|
This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\subsection{Newton AiM}
|
|
|
|
\subsection{Newton AiM}
|
|
|
|
\label{sec:pim_newton}
|
|
|
|
\label{sec:pim_newton}
|
|
|
|
|
|
|
|
|
|
|
|
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in \ac{gddr6} memory called Newton \cite{he2020}.
|
|
|
|
In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology using \ac{gddr6} memory called Newton \cite{he2020}.
|
|
|
|
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
|
|
|
|
In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core.
|
|
|
|
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
|
|
|
|
To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required.
|
|
|
|
Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
|
|
|
|
Another advantage of this approach is that there is no kernel startup delay required to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations.
|
|
|
|
On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted.
|
|
|
|
On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted.
|
|
|
|
In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks.
|
|
|
|
In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks.
|
|
|
|
The broadcasted input vector is then multiplied by a matrix row by doing a column access to the \ac{dram} bank, producing a $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values.
|
|
|
|
The broadcasted input vector is then multiplied by a matrix row by performing a column access to the \ac{dram} bank, producing $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values.
|
|
|
|
These temporary products are then reduced to a single output vector element by the adder tree in the bank.
|
|
|
|
These temporary products are then reduced to a single output vector element by the adder tree in the bank.
|
|
|
|
To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}.
|
|
|
|
To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}.
|
|
|
|
|
|
|
|
|
|
|
|
@@ -131,12 +130,12 @@ The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Architecture}
|
|
|
|
\subsubsection{Architecture}
|
|
|
|
\label{sec:pim_architecture}
|
|
|
|
\label{sec:pim_architecture}
|
|
|
|
As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
|
|
|
|
As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized \acp{subarray} \cite{kwon2021}.
|
|
|
|
A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
|
|
|
|
A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm} platforms.
|
|
|
|
Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
|
|
|
|
Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes.
|
|
|
|
Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
|
|
|
|
Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory into a \ac{pim}-enabled region and a normal \ac{hbm} region.
|
|
|
|
|
|
|
|
|
|
|
|
At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
|
|
|
|
At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
|
|
|
|
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
|
|
|
They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
|
|
|
|
This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
|
|
|
This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
|
|
|
|
|
|
|
|
|
|
|
|
@@ -149,8 +148,8 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the
|
|
|
|
|
|
|
|
|
|
|
|
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
|
|
|
As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
|
|
|
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
|
|
|
The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
|
|
|
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
|
|
|
|
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}.
|
|
|
|
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
|
|
|
|
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
|
|
|
|
|
|
|
|
|
|
|
|
\Ac{hbm}-\ac{pim} defines three operating modes:
|
|
|
|
\Ac{hbm}-\ac{pim} defines three operating modes:
|
|
|
|
\begin{enumerate}
|
|
|
|
\begin{enumerate}
|
|
|
|
@@ -191,7 +190,7 @@ This processing unit architecture is illustrated in \cref{img:pcu}, along with t
|
|
|
|
\end{figure}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
|
|
|
|
To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
|
|
|
|
To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
|
|
|
|
This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
|
|
|
|
This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization would be significant.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Instruction Set}
|
|
|
|
\subsubsection{Instruction Set}
|
|
|
|
|
|
|
|
|
|
|
|
@@ -206,15 +205,15 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}.
|
|
|
|
\end{table}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
|
|
|
|
For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit.
|
|
|
|
It is important to note that the JUMP instruction is a zero-cycle instruction, i.e. it is executed together with the instruction that precedes it.
|
|
|
|
It is important to note that the JUMP instruction is a zero-cycle instruction, i.e., it is executed together with the instruction that precedes it.
|
|
|
|
The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
|
|
|
|
The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications.
|
|
|
|
Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.
|
|
|
|
Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files.
|
|
|
|
|
|
|
|
|
|
|
|
The DST and SRC fields specify the operand type.
|
|
|
|
The DST and SRC fields specify the operand type.
|
|
|
|
That is, the register file or bank affected by the operation.
|
|
|
|
That is, the register file or bank affected by the operation.
|
|
|
|
Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols.
|
|
|
|
Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols.
|
|
|
|
The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
|
|
|
|
The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., the clamping of negative values to zero, while the data is moved to another location.
|
|
|
|
Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}.
|
|
|
|
Another special field \textit{A} enables the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{table}
|
|
|
|
\begin{table}
|
|
|
|
\centering
|
|
|
|
\centering
|
|
|
|
@@ -241,19 +240,19 @@ Another special field \textit{A} enabled the \ac{aam}, which will be explained i
|
|
|
|
\end{table}
|
|
|
|
\end{table}
|
|
|
|
|
|
|
|
|
|
|
|
The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
|
|
|
|
The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
|
|
|
|
It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
|
|
|
|
It is to note, that some operations do require specifically either a \ac{rd} or a \ac{wr} access to execute properly.
|
|
|
|
For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
|
|
|
|
For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
|
|
|
|
Likewise, reading from the banks, requires a \ac{rd} command.
|
|
|
|
Likewise, reading from the banks, requires a \ac{rd} command.
|
|
|
|
For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
|
|
|
|
For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
|
|
|
|
The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.
|
|
|
|
For the rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Instruction Ordering}
|
|
|
|
\subsubsection{Instruction Ordering}
|
|
|
|
\label{sec:instruction_ordering}
|
|
|
|
\label{sec:instruction_ordering}
|
|
|
|
|
|
|
|
|
|
|
|
Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
|
|
|
|
Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
|
|
|
|
When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
|
|
|
|
When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
|
|
|
|
This means that the order of the respective memory accesses for such instructions must not be reordered, as it must match the corresponding instruction in the microkernel.
|
|
|
|
This means that the order of the respective memory accesses for such instructions is important and must not be reordered, as it must match the corresponding instruction in the microkernel.
|
|
|
|
For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for the actual memory access to get the row and column address of the bank access.
|
|
|
|
For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for their actual memory access to receive the row and column address of the bank access.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{listing}
|
|
|
|
\begin{listing}
|
|
|
|
\begin{verbatim}
|
|
|
|
\begin{verbatim}
|
|
|
|
@@ -267,10 +266,10 @@ MAC GRF_B #1, BANK, GRF_A #1
|
|
|
|
Unfortunately, the memory controller between the host processor and the \ac{pim} memory is allowed to reorder memory fetches as long as they do not introduce hazards.
|
|
|
|
Unfortunately, the memory controller between the host processor and the \ac{pim} memory is allowed to reorder memory fetches as long as they do not introduce hazards.
|
|
|
|
This causes the register sources and destinations to be out of sync with the bank addresses.
|
|
|
|
This causes the register sources and destinations to be out of sync with the bank addresses.
|
|
|
|
One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, as only one memory transaction is handled by the controller at a time.
|
|
|
|
One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, as only one memory transaction is handled by the controller at a time.
|
|
|
|
However, this comes at a significant performance cost and results in memory bandwidth being underutilized as the host processor has to wait for every memory access to complete.
|
|
|
|
However, this comes at a significant performance cost and results in memory bandwidth being underutilized because the host processor has to wait for every memory access to complete.
|
|
|
|
Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.
|
|
|
|
Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.
|
|
|
|
|
|
|
|
|
|
|
|
To solve this overhead, Samsung has implemented the \ac{aam} mode for arithmetic instructions.
|
|
|
|
To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions.
|
|
|
|
In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}.
|
|
|
|
In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}.
|
|
|
|
With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
|
|
|
|
With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
|
|
|
|
|
|
|
|
|
|
|
|
@@ -295,7 +294,7 @@ JUMP -1, 7
|
|
|
|
|
|
|
|
|
|
|
|
Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
|
|
|
|
Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
|
|
|
|
The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
|
|
|
|
The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
|
|
|
|
To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
|
|
|
|
To achieve this particular operation, where the addresses are used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
|
|
|
|
This memory layout is explained in detail in \cref{sec:memory_layout}.
|
|
|
|
This memory layout is explained in detail in \cref{sec:memory_layout}.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Programming Model}
|
|
|
|
\subsubsection{Programming Model}
|
|
|
|
@@ -305,10 +304,10 @@ Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{h
|
|
|
|
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
|
|
|
|
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
|
|
|
|
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
|
|
|
|
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
|
|
|
|
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
|
|
|
|
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
|
|
|
|
At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
|
|
|
|
At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows for either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or for automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
|
|
|
|
|
|
|
|
|
|
|
|
The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
|
|
|
|
The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
|
|
|
|
Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
|
|
|
|
Since \aca{hbm} memory is mainly used in conjunction with \acp{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
|
|
|
|
The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size.
|
|
|
|
The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size.
|
|
|
|
Such a group of software threads is called a thread group.
|
|
|
|
Such a group of software threads is called a thread group.
|
|
|
|
Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hbm} configuration with four memory stacks and a total of 64 \acp{pch}.
|
|
|
|
Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hbm} configuration with four memory stacks and a total of 64 \acp{pch}.
|
|
|
|
@@ -334,7 +333,7 @@ Note, that this interleaving of \ac{fp16} vectors is very similar to the chunkin
|
|
|
|
|
|
|
|
|
|
|
|
The input vector must adhere also a special memory layout.
|
|
|
|
The input vector must adhere also a special memory layout.
|
|
|
|
Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
|
|
|
|
Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
|
|
|
|
However, since all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
|
|
|
|
However, because all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
|
|
|
|
As there is no communication between the banks, every bank needs to have its own copy of the input vector.
|
|
|
|
As there is no communication between the banks, every bank needs to have its own copy of the input vector.
|
|
|
|
Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
|
|
|
|
Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
|
|
|
|
This interleaving is illustrated in \cref{img:input_vector}.
|
|
|
|
This interleaving is illustrated in \cref{img:input_vector}.
|
|
|
|
@@ -360,7 +359,7 @@ psum[i,0:15]=\sum_{j=0}^{8}(a[j \cdot 16:j \cdot 16+15] \cdot w[i,j \cdot 16:j \
|
|
|
|
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
|
|
|
|
The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
|
|
|
|
This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
|
|
|
|
This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
|
|
|
|
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
|
|
|
|
In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
|
|
|
|
Note that consequently the activation function often used in \acp{dnn}, i.e. \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function.
|
|
|
|
Note that consequently the activation function often used in \acp{dnn}, i.e., \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function.
|
|
|
|
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
|
|
|
|
The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
|
|
|
|
|
|
|
|
|
|
|
|
\begin{figure}
|
|
|
|
\begin{figure}
|
|
|
|
@@ -385,13 +384,13 @@ JUMP -1, 63
|
|
|
|
\end{listing}
|
|
|
|
\end{listing}
|
|
|
|
|
|
|
|
|
|
|
|
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
|
|
|
|
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
|
|
|
|
Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times the different input vector chunks and weight matrix columns.
|
|
|
|
Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns.
|
|
|
|
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
|
|
|
|
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
|
|
|
|
|
|
|
|
|
|
|
|
\subsubsection{Performance and Power Efficiency Effects}
|
|
|
|
\subsubsection{Performance and Power Efficiency Effects}
|
|
|
|
|
|
|
|
|
|
|
|
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements.
|
|
|
|
In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements.
|
|
|
|
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}.
|
|
|
|
This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that is integrated onto the same silicon interposer as four \aca{hbm} stacks, with each consisting of one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}.
|
|
|
|
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
|
|
|
|
Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
|
|
|
|
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
|
|
|
|
Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
|
|
|
|
|
|
|
|
|
|
|
|
|