master-thesis/src/chapters/results.tex

\section{Simulation Results}
\label{sec:results}

This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads.
After a brief introduction of the simulated system architecture, an estimated theoretical performance gain is calculated.
This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment.
Furthermore, the variations of the system parameters for each workload will be explored.
A set of simulations is then run based on these parameters and the resulting performance improvements are analyzed.
% Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}.

\subsection{System Architecture}
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.

% some implementation details
% hbm size, channel...
% operating at ...MHz
% theoretical bandwidth and FLOPS...
% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall

\subsection{Accuracy and Comparability}
When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed.
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate the \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads.

% bare-metal ist optimalfall, linux wäre realistischere testumgebung
% overhead der setuptime kann nicht richtig gemessen werden
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter

\subsection{Objectives}
Through the simulations, the research aims to address and find answers to several objectives.
As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads.
For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is.
Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant.
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.

Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations.
To model the inference of a \ac{dnn}, a singular \ac{gemv} operation is first performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
Namely, after the reduction step of the output vector, an activation function, i.e. \ac{relu}, is applied before the vector is passed as input to the next layer.

% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
% The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated.
% Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area.
% Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit.

To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations:
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}.
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundness can be observed.

% different kernels
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
% sweep of matrix dimensions rows/columns, break even point
% ReLU in DRAM vs on host

% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)

\subsection{Simulation Results}
\subsubsection{Vector Operations}
% Vector ADD und Vector MUL
% Vector Skalar ADD und Vector Skalar MUL (HCAL) (wird wohl übersprungen)
% Vector HAXPY x*a+y

% Plots zB VADD VMUL nebeneinander für versch. Dimensionen und einer Frequenz
% andere Frequenz nächster Plot
% dann HAXPY

The first set of benchmarks analyzes the speedup of \aca{fimdram} for various vector operations, namely an element-wise vector add operation (VADD), an element-wise vector multiply operation (VMUL), and a \ac{haxpy} operation.
Such vector operations have a low operational density and are particularly memory-bounded because there is no data reuse at all and two input operands must be loaded for each operation.
As a result, the on-chip cache does not accelerate such workloads because all operand data must be fetched from memory anyway.
The workloads adhere to the following calculation patterns:

\begin{itemize}
\item VADD: $z = x + y$
\item VMUL: $z = x \cdot y$
\item \ac{haxpy}: $z = a \cdot x + y$
\end{itemize}

Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable.
\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks.
The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 256, which is the minimum size that can be represented in a \ac{pim} data structure.

\begin{table}
\centering
\begin{tblr}{
  column{1} = {c},
  cell{2}{2} = {r},
  cell{3}{2} = {r},
  cell{4}{2} = {r},
  cell{5}{2} = {r},
  hlines,
  vlines,
  hline{2} = {-}{solid,black},
  hline{2} = {2}{-}{solid,black},
}
Level & Vector Dimensions \\
X1    & (256 $\times$ 1)  \\
X2    & (512 $\times$ 1)  \\
X3    & (1024 $\times$ 1) \\
X4    & (2048 $\times$ 1)
\end{tblr}
\caption{List of the input vector dimensions for the vector benchmarks.}
\label{tab:dimensions_vector}
\end{table}

The benchmarks analyze the relative number of processor ticks for \ac{pim} compared to non-\ac{pim} where the speedup $S$ is calculated as follows:
\begin{equation}
S = \frac{\textrm{\#ticks in non-\ac{pim} mode}}{\textrm{\#ticks in \ac{pim} mode}}
\end{equation}

\begin{figure}
    \centering
    \input{plots/vector_normal}
    \caption{Comparison between non-\ac{pim} and \ac{pim} for the vector benchmarks running at a \ac{cpu} frequency of $\qty{3}{\giga\hertz}$.}
    \label{fig:vector_normal}
\end{figure}

\Cref{fig:vector_normal} shows the relative performance for the vector benchmarks, running on the generic ARM-based system at a typical clock frequency.
The relative speedup of \ac{pim} is in the range of about $\qtyrange{12.8}{31.8}{\times}$ with limited variance for each benchmark between the different vector dimensions, since such vector operations essentially scale linearly with the length of the input operands for both the non-\ac{pim} and \ac{pim} approaches.
The \ac{haxpy} benchmark has the highest variance with a range of $\qtyrange{19.8}{31.8}{\times}$, which is due to the fact that each value of the one input vector must first be multiplied by a scalar amount on the \ac{cpu} before the addition operation, while in the \ac{pim} case the specialized \ac{mad} instruction is used.
As all speedup values are well above 1, it can be concluded that even the smallest representable vector size of 256 is already above the break-even point at which \ac{pim} becomes viable.

\begin{figure}
    \centering
    \input{plots/vector_infinite}
    \caption{Comparison between non-\ac{pim} and \ac{pim} for the vector benchmarks running on the infinite compute platform.}
    \label{fig:vector_infinite}
\end{figure}

In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}.
As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{1.7}{2.4}{\times}$ lower than in the generic system.
The variance of the speedup between the different vector dimensions are also rather small.
For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows:
The additional computation step of the scalar multiplication does not affect the non-\ac{pim} system as much as in the previous case, because this is insignificant compared to the memory fetch of the vector elements.

% vectors: im wesentlichen skaliert beides mit der länge es vecktors, minimal weniger overhead
% haxpy: skalarmultiplikation macht CPU bedeutend langsamer, deswegen fällt dieser unterscheid bei 100GHz auch weg

\subsubsection{Neural Network Layers}
% GEMV
    % Samsung 7.4x-8.9x
% "inference" mit mehreren layern
     % ReLU vergleich

% GEMM mit stark interleavten matrizen (eher nicht)

In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also investigated for the level 2 \ac{blas} routine \ac{gemv}.
Besides the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network.
Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer.
In the simplest form, quadratic matrix dimensions ensure that the output vector of each layer has the same dimensions as the input vector, which simplifies the chaining in the benchmark.
Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}.

\begin{table}
\centering
\begin{tblr}{
  column{1} = {c},
  cell{2}{2} = {r},
  cell{3}{2} = {r},
  cell{4}{2} = {r},
  cell{5}{2} = {r},
  cell{2}{3} = {r},
  cell{3}{3} = {r},
  cell{4}{3} = {r},
  cell{5}{3} = {r},
  hlines,
  vlines,
  hline{2} = {-}{solid,black},
  hline{2} = {2}{-}{solid,black},
}
Level & \ac{gemv} Matrix Dimensions & \ac{dnn} Matrix Dimensions   \\
X1    & (128 $\times$ 128)  & (128 $\times$ 128) \\
X2    & (256 $\times$ 128)  & (256 $\times$ 256) \\
X3    & (512 $\times$ 128)  & (512 $\times$ 512) \\
X4    & (1024 $\times$ 128) & (1024 $\times$ 1024)
\end{tblr}
\caption{List of the matrix dimensions for the neural network benchmarks.}
\label{tab:dimensions_matrix}
\end{table}

In the \ac{gemv} benchmarks, only the number of rows is increased at each step, which means that the \ac{pim} microkernel has to perform more iterations of the \ac{mac} kernel, but does not have to load another chunk of the input vector, since it fits completely into the \ac{grf}-A registers.

\begin{figure}[ht]
    \centering
    \input{plots/matrix_normal}
    \caption{Comparison between non-\ac{pim} and \ac{pim} for the \ac{gemv} benchmarks running at a \ac{cpu} frequency of $\qty{3}{\giga\hertz}$.}
    \label{fig:matrix_normal}
\end{figure}

\Cref{fig:matrix_normal} shows the relative performance for the \ac{gemv} benchmarks that are run on the system at a normal clock speed.
The speedup for a single \ac{gemv} operation is in the range of $\qtyrange{3.5}{23.6}{\times}$ and for the simple \ac{dnn} layers $\qtyrange{3.0}{72.3}{\times}$.
Unlike in the vector benchmarks, the performance gains become drastically more significant with increasing matrix dimensions, where \ac{pim} can exploit its specialized architecture for this type of operation.
A possible explanation is that the initial overhead of executing the microkernel in the \aca{fimdram} processing units quickly becomes insignificant with increasing operand dimensions compared to the actual execution time.
Also, in all cases, the smallest representable operand dimensions already achieve a speedup of over one, suggesting that the break-even point of \ac{pim}'s viability for this system is below these dimensions.
Since the speedup approaches $\qty{100}{\times}$ in the \ac{dnn} benchmark, it can be concluded that \ac{pim} offers an immense performance advantage in this system configuration.

\begin{figure}
    \centering
    \input{plots/matrix_infinite}
    \caption{Comparison between non-\ac{pim} and \ac{pim} for the \ac{gemv} benchmarks running on the infinite compute platform.}
    \label{fig:matrix_infinite}
\end{figure}

The \ac{gemv} and \ac{dnn} benchmarks, however show a more differentiated view for the infinite compute approach that models the completely memory-bounded system:
For smaller matrix dimensions, the usage of \ac{pim} slows the execution down up to a factor of $\qty{0.21}{\times}$ for the \ac{gemv} benchmark and even $\qty{0.18}{\times}$ for the \ac{dnn} layers.
However, the speedup quickly increases with the larger dimensions, reaches its break-even point at the third step and shows a maximum speedup of $\qty{4.7}{\times}$ and $\qty{6.1}{\times}$ for the \ac{gemv} and \ac{dnn} benchmark respectively.
These results provide a more realistic view of \aca{fimdram}:
For workloads and accelerator systems that are truly memory-bound, performance improvements can be on the order of the simulated $\qty{6.1}{\times}$.
This result is largely in line with the numbers published by Samsung, which were already introduced in \cref{sec:fimdram_performance} and will be compared in more detail with the simulation results in the next section.

\subsubsection{Comparison to Samsung's Simulation Results}

To reiterate, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \aca{fimdram} memory packages.
Similarly to the above investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}.

\begin{table}
\centering
\begin{tblr}{
  cell{2}{2} = {r},
  cell{3}{2} = {r},
  cell{4}{2} = {r},
  cell{5}{2} = {r},
  cell{2}{3} = {r},
  cell{3}{3} = {r},
  cell{4}{3} = {r},
  cell{5}{3} = {r},
  hlines,
  vlines,
  hline{2} = {-}{solid,black},
  hline{2} = {2}{-}{solid,black},
}
Level & \ac{gemv} Dimensions & ADD Dimensions   \\
Level 1  & (1k $\times$ 4k)  & (2M) \\
Level 2  & (2k $\times$ 4k)  & (4M) \\
Level 3  & (4k $\times$ 8k)  & (8M) \\
Level 4  & (8k $\times$ 8k) & (16M)
\end{tblr}
\caption{List of the operand dimensions for the microbenchmarks used by Samsung \cite{lee2021}.}
\label{tab:samsung_dimensions}
\end{table}

Each simulation is run with different batch sizes, where a higher batch size allows for better cache utilization, as multiple operations are performed on the same data set, making the workload less memory bound and rendering \ac{pim} less effective.
All the microbenchmarks discussed so far do not perform batching, so all comparisons are performed on the result values for the batch size of 1, which correspond with the blue bars in \cref{fig:samsung_speedup}.
Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the limited \ac{cpu} approach, as both systems should operate in the memory-bounded region.

\begin{figure}
    \centering
    \includegraphics[width=0.8\linewidth]{plots/samsung}
    \caption[Relative performance of the \ac{gemv} and ADD microbenchmark for different batch sizes on the hardware implementation of Samsung.]{Relative performance of the \ac{gemv} and ADD microbenchmark for different batch sizes on the hardware implementation of Samsung \cite{lee2021}.}
    \label{fig:samsung_speedup}
\end{figure}

The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
When compared to the simulated platform, the variance is also limited with a range of $\qtyrange{1.6}{2.4}{\times}$, which corresponds well with the findings of Samsung.
The \ac{gemv} microbenchmark on the other hand shows a more drastic speedup with an average value of $\qty{8.3}{\times}$.
Although the dimensions used by Samsung are different from the simulations of this thesis, the highest achieved speedup of $\qty{6.1}{\times}$ is well within the reach of the real hardware implementation.

\subsubsection{Comparison to Real Hardware}
TODO: check all ranges

In addition to the comparison of Samsung's real hardware implementation, the same benchmarks of the performed simulations are run on a [...] with HBM2 [...].
As this system is using a generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case.

\begin{figure}
    \centering
    \resizebox{\linewidth}{!}{%
    \input{plots/runtimes_vector}
    }
    \caption{}
    \label{fig:runtimes_vector}
\end{figure}

\begin{figure}
    \centering
    % \resizebox{\linewidth}{!}{%
    \input{plots/runtimes_matrix}
    % }
    \caption{}
    \label{fig:runtimes_matrix}
\end{figure}

% \subsubsection{Initialization Overhead}
% conversion der operanden im verhältnis zur laufzeit abschätzen

% \subsubsection{Shared Processing Units}
% Geteilte processing units vs jede Bank eine
% GEMV