171 lines
11 KiB
TeX
171 lines
11 KiB
TeX
\section{Simulation Results}
|
|
\label{sec:results}
|
|
|
|
This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads.
|
|
After a brief introduction of the simulated system architecture, an estimated theoretical performance gain is calculated.
|
|
This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment.
|
|
Furthermore, the variations of the system parameters for each workload will be explored.
|
|
A set of simulations is then run based on these parameters and the resulting performance improvements are analyzed.
|
|
% Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}.
|
|
|
|
\subsection{System Architecture}
|
|
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
|
Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
|
|
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
|
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
|
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
|
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
|
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
|
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
|
|
|
|
% some implementation details
|
|
% hbm size, channel...
|
|
% operating at ...MHz
|
|
% theoretical bandwidth and FLOPS...
|
|
% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall
|
|
|
|
\subsection{Accuracy and Comparability}
|
|
When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed.
|
|
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
|
|
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
|
|
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
|
|
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate the \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
|
|
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
|
|
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
|
|
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
|
|
Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads.
|
|
|
|
% bare-metal ist optimalfall, linux wäre realistischere testumgebung
|
|
% overhead der setuptime kann nicht richtig gemessen werden
|
|
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
|
|
|
|
\subsection{Objectives}
|
|
Through the simulations, the research aims to address and find answers to several objectives.
|
|
As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
|
|
The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads.
|
|
For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is.
|
|
Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant.
|
|
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
|
|
|
|
Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations.
|
|
To model the inference of a \ac{dnn}, a singular \ac{gemv} operation is first performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
|
|
Namely, after the reduction step of the output vector, an activation function, i.e. \ac{relu}, is applied before the vector is passed as input to the next layer.
|
|
|
|
% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
|
|
% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
|
|
% The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated.
|
|
% Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area.
|
|
% Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit.
|
|
|
|
To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations:
|
|
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
|
|
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
|
|
While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
|
|
With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}.
|
|
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundness can be observed.
|
|
|
|
% different kernels
|
|
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
|
|
% sweep of matrix dimensions rows/columns, break even point
|
|
% ReLU in DRAM vs on host
|
|
|
|
% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
|
|
|
|
\subsection{Simulation Results}
|
|
\subsubsection{Vector Operations}
|
|
% Vector ADD und Vector MUL
|
|
% Vector Skalar ADD und Vector Skalar MUL (HCAL) (wird wohl übersprungen)
|
|
% Vector HAXPY x*a+y
|
|
|
|
% Plots zB VADD VMUL nebeneinander für versch. Dimensionen und einer Frequenz
|
|
% andere Frequenz nächster Plot
|
|
% dann HAXPY
|
|
|
|
The first set of benchmarks analyzes the speedup of \aca{fimdram} for various vector operations, namely an element-wise vector add operation (VADD), an element-wise vector multiply operation (VMUL), and a \ac{haxpy} operation.
|
|
Such vector operations have a low operational density and are particularly memory-bounded because there is no data reuse at all and two input operands must be loaded for each operation.
|
|
As a result, the on-chip cache does not accelerate such workloads because all operand data must be fetched from memory anyway.
|
|
The workloads adhere to the following calculation patterns:
|
|
|
|
\begin{itemize}
|
|
\item VADD: $z = x + y$
|
|
\item VMUL: $z = x \cdot y$
|
|
\item \ac{haxpy}: $z = a \cdot x + y$
|
|
\end{itemize}
|
|
|
|
Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable.
|
|
\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks.
|
|
The levels X1-X4 denote the increasing dimensions, with each step doubling in size.
|
|
|
|
\begin{table}
|
|
\centering
|
|
\begin{tblr}{
|
|
cell{2}{2} = {r},
|
|
cell{3}{2} = {r},
|
|
cell{4}{2} = {r},
|
|
cell{5}{2} = {r},
|
|
hlines,
|
|
vlines,
|
|
hline{2} = {-}{solid,black},
|
|
hline{2} = {2}{-}{solid,black},
|
|
}
|
|
Level & Dimensions \\
|
|
X1 & (256 $\times$ 1) \\
|
|
X2 & (512 $\times$ 1) \\
|
|
X3 & (1024 $\times$ 1) \\
|
|
X4 & (2048 $\times$ 1)
|
|
\end{tblr}
|
|
\caption{List of the input vector dimensions for the vector benchmarks.}
|
|
\label{tab:dimensions_vector}
|
|
\end{table}
|
|
|
|
The benchmarks analyze the relative number of processor ticks where the speedup is calculated as follows:
|
|
\begin{equation}
|
|
S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
|
|
\end{equation}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\input{plots/vector_normal}
|
|
\caption{Comparison between non-\ac{pim} and \ac{pim} for the vector benchmarks running at a \ac{cpu} frequency of $\qty{3}{\giga\hertz}$.}
|
|
\label{fig:vector_normal}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\input{plots/vector_infinite}
|
|
\caption{test}
|
|
\label{fig:vector_infinite}
|
|
\end{figure}
|
|
|
|
\subsubsection{Neural Network Layers}
|
|
% GEMV
|
|
% Samsung 7.4x-8.9x
|
|
% "inference" mit mehreren layern
|
|
% ReLU vergleich
|
|
|
|
% GEMM mit stark interleavten matrizen (eher nicht)
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\input{plots/matrix_normal}
|
|
\caption{test}
|
|
\label{fig:matrix_normal}
|
|
\end{figure}
|
|
|
|
\begin{figure}
|
|
\centering
|
|
\input{plots/matrix_infinite}
|
|
\caption{test}
|
|
\label{fig:matrix_infinite}
|
|
\end{figure}
|
|
|
|
\subsubsection{Comparison to Real Hardware}
|
|
|
|
% \subsubsection{Initialization Overhead}
|
|
% conversion der operanden im verhältnis zur laufzeit abschätzen
|
|
|
|
% \subsubsection{Shared Processing Units}
|
|
% Geteilte processing units vs jede Bank eine
|
|
% GEMV
|
|
|