Smaller refactorings in result chapter

This commit is contained in:
2024-03-08 23:26:59 +01:00
parent 42ef5e7672
commit 4430c10aad
2 changed files with 12 additions and 10 deletions

View File

@@ -8,7 +8,7 @@ A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been de
This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.
It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
While these results may not strictly represent a real-world system, an achievable upper bound of speedups of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
While these results may not strictly represent a real-world system, an achievable speedup of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.

View File

@@ -12,11 +12,12 @@ A set of simulations is then run based on these parameters and the resulting per
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$
The data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
% some implementation details
@@ -30,8 +31,9 @@ When interpreting the following simulation results, it is important to note that
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate the \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
These accelerators may have significantly different execution behavior.
For example, a \ac{gpu} may aggressively accelerate \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that change the behavior of \aca{fimdram}.
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads.
@@ -41,7 +43,7 @@ Nevertheless, the simulations performed provide an informative insight into the
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
\subsection{Objectives}
Through the simulations, the research aims to address and find answers to several objectives.
Through simulations, the research aims to address and find answers to several objectives.
As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads.
For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is.
@@ -49,8 +51,8 @@ Small dimensions suffer from a high impact of the setup overhead, while for larg
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations.
To model the inference of a \ac{dnn}, a singular \ac{gemv} operation is first performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
Namely, after the reduction step of the output vector, an activation function, i.e. \ac{relu}, is applied before the vector is passed as input to the next layer.
To model the inference of a \ac{dnn}, first a singular \ac{gemv} operation is performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
Namely, after the reduction step of the output vector, an activation function, i.e., \ac{relu}, is applied before the vector is passed as input to the next layer.
% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
@@ -60,10 +62,10 @@ Namely, after the reduction step of the output vector, an activation function, i
To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations:
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
However, also two configurations with the same ARM processor but with a nearly infinite clock frequency is performed.
While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}.
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundness can be observed.
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment.
% different kernels
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)