master-thesis/src/chapters/results.tex

\section{Simulation Results}
\label{sec:results}

This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads.
After a brief introduction to the simulated system architecture, an estimated theoretical performance gain is calculated.
This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment.
Furthermore, the variations of the system parameters for each workload will be explored.
The set of simulations is then run based on these parameters and the resulting performance improvements are analyzed.
Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}.

\subsection{System Architecture}
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel.

% some implementation details
% hbm size, channel...
% operating at ...MHz
% theoretical bandwidth and FLOPS...
% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall

\subsection{Accuracy and Comparability}
When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed.
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate inference by performing many parallel operations, or a \ac{tpu} may use specialized nets for matrix vector operations such as systolic arrays.
Such differences would also reflect themselves in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
Nevertheless, the simulations performed provide an informative insight on the effectiveness of \aca{fimdram} and its suited workloads.

% bare-metal ist optimalfall, linux wäre realistischere testumgebung
% overhead der setuptime kann nicht richtig gemessen werden
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter

\subsection{Objectives}
Through the simulations, the research aims to address and find answers to several objectives.
As already discussed in \cref{sec:pim}, \ac{pim} aims to accellerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
This difference should be analyzed by performing the simulations on various different workloads.
For these workloads, the input dimensions may play an important role in how effective \ac{pim} is.
Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant.
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
\Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated.
Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area.
Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit.

To evaluate these objectives, the set of simulations is each employed in four different configurations:
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
While these configurations do not reflect a real system, they are used to address the already mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
With infinite computational power, the simulation is guaranteed to be bounded only by the memory system.
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundedness can be observed.

% different kernels
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
% sweep of matrix dimensions rows/columns, break even point
% ReLU in DRAM vs on host

% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)

\subsection{Simulation Results}
\subsubsection{Workload Kernels}
% Vector ADD und Vector MUL
% Vector Skalar ADD und Vector Skalar MUL (HCAL)
% Vector HAXPY x*a+y
% GEMV
    % Samsung 7.4x-8.9x
% "inference" mit mehreren layern
     % ReLU vergleich

% GEMM mit stark interleavten matrizen

\subsubsection{Initialization Overhead}
% conversion der operanden im verhältnis zur laufzeit abschätzen

\subsubsection{Shared Processing Units}
% Geteilte processing units vs jede Bank eine
% GEMV