Smaller refactorings in result chapter
This commit is contained in:
@@ -8,7 +8,7 @@ A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been de
|
|||||||
This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.
|
This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.
|
||||||
|
|
||||||
It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
|
It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
|
||||||
While these results may not strictly represent a real-world system, an achievable upper bound of speedups of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
|
While these results may not strictly represent a real-world system, an achievable speedup of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
|
||||||
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
|
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
|
||||||
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.
|
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.
|
||||||
|
|
||||||
|
|||||||
@@ -12,11 +12,12 @@ A set of simulations is then run based on these parameters and the resulting per
|
|||||||
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
||||||
Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
|
Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
|
||||||
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
||||||
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$
|
||||||
|
The data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
||||||
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
||||||
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
||||||
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
||||||
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
|
Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
|
||||||
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
|
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
|
||||||
|
|
||||||
% some implementation details
|
% some implementation details
|
||||||
@@ -30,8 +31,9 @@ When interpreting the following simulation results, it is important to note that
|
|||||||
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
|
Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
|
||||||
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
|
The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
|
||||||
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
|
Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
|
||||||
These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate the \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
|
These accelerators may have significantly different execution behavior.
|
||||||
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
|
For example, a \ac{gpu} may aggressively accelerate \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations.
|
||||||
|
Those differences would also be reflected in the memory access pattern, and may be subject to other effects that change the behavior of \aca{fimdram}.
|
||||||
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
|
Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
|
||||||
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
|
The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
|
||||||
Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads.
|
Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads.
|
||||||
@@ -41,7 +43,7 @@ Nevertheless, the simulations performed provide an informative insight into the
|
|||||||
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
|
% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
|
||||||
|
|
||||||
\subsection{Objectives}
|
\subsection{Objectives}
|
||||||
Through the simulations, the research aims to address and find answers to several objectives.
|
Through simulations, the research aims to address and find answers to several objectives.
|
||||||
As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
|
As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
|
||||||
The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads.
|
The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads.
|
||||||
For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is.
|
For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is.
|
||||||
@@ -49,8 +51,8 @@ Small dimensions suffer from a high impact of the setup overhead, while for larg
|
|||||||
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
|
The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
|
||||||
|
|
||||||
Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations.
|
Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations.
|
||||||
To model the inference of a \ac{dnn}, a singular \ac{gemv} operation is first performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
|
To model the inference of a \ac{dnn}, first a singular \ac{gemv} operation is performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines.
|
||||||
Namely, after the reduction step of the output vector, an activation function, i.e. \ac{relu}, is applied before the vector is passed as input to the next layer.
|
Namely, after the reduction step of the output vector, an activation function, i.e., \ac{relu}, is applied before the vector is passed as input to the next layer.
|
||||||
|
|
||||||
% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
|
% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
|
||||||
% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
|
% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
|
||||||
@@ -60,10 +62,10 @@ Namely, after the reduction step of the output vector, an activation function, i
|
|||||||
|
|
||||||
To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations:
|
To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations:
|
||||||
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
|
With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
|
||||||
However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
|
However, also two configurations with the same ARM processor but with a nearly infinite clock frequency is performed.
|
||||||
While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
|
While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
|
||||||
With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}.
|
With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}.
|
||||||
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundness can be observed.
|
This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment.
|
||||||
|
|
||||||
% different kernels
|
% different kernels
|
||||||
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
|
% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
|
||||||
|
|||||||
Reference in New Issue
Block a user