88 lines
7.3 KiB
TeX
88 lines
7.3 KiB
TeX
\section{Simulation Results}
|
|
\label{sec:results}
|
|
|
|
This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads.
|
|
After a brief introduction to the simulated system architecture, an estimated theoretical performance gain is calculated.
|
|
This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment.
|
|
Furthermore, the variations of the system parameters for each workload will be explored.
|
|
The set of simulations is then run based on these parameters and the resulting performance improvements are analyzed.
|
|
Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}.
|
|
|
|
\subsection{System Architecture}
|
|
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
|
Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
|
|
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
|
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
|
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
|
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
|
To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
|
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel.
|
|
|
|
% some implementation details
|
|
% hbm size, channel...
|
|
% operating at ...MHz
|
|
% theoretical bandwidth and FLOPS...
|
|
% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall
|
|
|
|
\subsection{Accuracy and Comparability}
|
|
When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed.
|
|
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
|
|
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
|
|
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
|
|
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate inference by performing many parallel operations, or a \ac{tpu} may use specialized nets for matrix vector operations such as systolic arrays.
|
|
Such differences would also reflect themselves in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
|
|
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
|
|
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
|
|
Nevertheless, the simulations performed provide an informative insight on the effectiveness of \aca{fimdram} and its suited workloads.
|
|
|
|
% bare-metal ist optimalfall, linux wäre realistischere testumgebung
|
|
% overhead der setuptime kann nicht richtig gemessen werden
|
|
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
|
|
|
|
\subsection{Objectives}
|
|
Through the simulations, the research aims to address and find answers to several objectives.
|
|
As already discussed in \cref{sec:pim}, \ac{pim} aims to accellerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
|
|
This difference should be analyzed by performing the simulations on various different workloads.
|
|
For these workloads, the input dimensions may play an important role in how effective \ac{pim} is.
|
|
Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant.
|
|
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
|
|
When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
|
|
\Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
|
|
The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated.
|
|
Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area.
|
|
Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit.
|
|
|
|
To evaluate these objectives, the set of simulations is each employed in four different configurations:
|
|
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
|
|
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
|
|
While these configurations do not reflect a real system, they are used to address the already mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
|
|
With infinite computational power, the simulation is guaranteed to be bounded only by the memory system.
|
|
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundedness can be observed.
|
|
|
|
% different kernels
|
|
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
|
|
% sweep of matrix dimensions rows/columns, break even point
|
|
% ReLU in DRAM vs on host
|
|
|
|
% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
|
|
|
|
\subsection{Simulation Results}
|
|
\subsubsection{Workload Kernels}
|
|
% Vector ADD und Vector MUL
|
|
% Vector Skalar ADD und Vector Skalar MUL (HCAL)
|
|
% Vector HAXPY x*a+y
|
|
% GEMV
|
|
% Samsung 7.4x-8.9x
|
|
% "inference" mit mehreren layern
|
|
% ReLU vergleich
|
|
|
|
% GEMM mit stark interleavten matrizen
|
|
|
|
\subsubsection{Initialization Overhead}
|
|
% conversion der operanden im verhältnis zur laufzeit abschätzen
|
|
|
|
\subsubsection{Shared Processing Units}
|
|
% Geteilte processing units vs jede Bank eine
|
|
% GEMV
|
|
|