diff --git a/src/acronyms.tex b/src/acronyms.tex index 2dba8d1..1c7b303 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -22,6 +22,10 @@ short = AI, long = artificial intelligence, } +\DeclareAcronym{cpu}{ + short = CPU, + long = central processing unit, +} \DeclareAcronym{gpu}{ short = GPU, long = graphics processing unit, @@ -204,6 +208,10 @@ short = FPU, long = floating-point unit, } +\DeclareAcronym{fp}{ + short = FP, + long = floating-point, +} \DeclareAcronym{crf}{ short = CRF, long = command register file, diff --git a/src/appendix.tex b/src/appendix.tex index f4a5669..2d858db 100644 --- a/src/appendix.tex +++ b/src/appendix.tex @@ -4,4 +4,5 @@ % etwas source code, % von der vm % einige microkernels +% auch execution der microkernel % ... diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex index ed7755d..9656fe5 100644 --- a/src/chapters/implementation/kernel.tex +++ b/src/chapters/implementation/kernel.tex @@ -93,6 +93,7 @@ In order to incorporate this memory allocator, it was initialized by providing a The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures. \subsubsection{Memory Configuration} +\label{sec:memory_configuration} As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}. These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus. diff --git a/src/chapters/results.tex b/src/chapters/results.tex index e426c0a..1ad7ab4 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -1,4 +1,86 @@ \section{Simulation Results} \label{sec:results} -% gem5 m5ops routines/implementation in the kernel +This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads. +After a brief introduction to the simulated system architecture, an estimated theoretical performance gain is calculated. +This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment. +Furthermore, the variations of the system parameters for each workload will be explored. +The set of simulations is then run based on these parameters and the resulting performance improvements are analyzed. +Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}. + +\subsection{System Architecture} +The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}. +Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks. +A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$. +The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}. +Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$. +In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$. +To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. +Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel. + +% some implementation details +% hbm size, channel... +% operating at ...MHz +% theoretical bandwidth and FLOPS... +% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall + +\subsection{Accuracy and Comparability} +When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed. +Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way. +The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed. +Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}. +These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate inference by performing many parallel operations, or a \ac{tpu} may use specialized nets for matrix vector operations such as systolic arrays. +Such differences would also reflect themselves in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}. +Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands. +The high overhead of a \ac{pim} operation on a small data set may be underrepresented. +Nevertheless, the simulations performed provide an informative insight on the effectiveness of \aca{fimdram} and its suited workloads. + +% bare-metal ist optimalfall, linux wäre realistischere testumgebung +% overhead der setuptime kann nicht richtig gemessen werden +% Inference auf CPU ist untypisch, GPU modell wäre geeigneter + +\subsection{Objectives} +Through the simulations, the research aims to address and find answers to several objectives. +As already discussed in \cref{sec:pim}, \ac{pim} aims to accellerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}. +This difference should be analyzed by performing the simulations on various different workloads. +For these workloads, the input dimensions may play an important role in how effective \ac{pim} is. +Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant. +The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable. +When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer. +\Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers. +The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated. +Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area. +Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit. + +To evaluate these objectives, the set of simulations is each employed in four different configurations: +With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved. +However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed. +While these configurations do not reflect a real system, they are used to address the already mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}. +With infinite computational power, the simulation is guaranteed to be bounded only by the memory system. +This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundedness can be observed. + +% different kernels +% shared pim units (-> halbe rows / halbe performance, soll überprüft werden) +% sweep of matrix dimensions rows/columns, break even point +% ReLU in DRAM vs on host + +% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware) + +\subsection{Simulation Results} +\subsubsection{Workload Kernels} +% Vector ADD und Vector MUL +% Vector Skalar ADD und Vector Skalar MUL +% GEMV + % Samsung 7.4x-8.9x +% "inference" mit mehreren layern + % ReLU vergleich + +% GEMM mit stark interleavten matrizen + +\subsubsection{Initialization Overhead} +% conversion der operanden im verhältnis zur laufzeit abschätzen + +\subsubsection{Shared Processing Units} +% Geteilte processing units vs jede Bank eine +% GEMV + diff --git a/src/doc.bib b/src/doc.bib index 13677fd..f6158d0 100644 --- a/src/doc.bib +++ b/src/doc.bib @@ -24,10 +24,21 @@ author = {{ARM}}, date = {2015-03-24}, url = {https://developer.arm.com/documentation/den0024/latest/}, + urldate = {2024-01-08}, langid = {english}, file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf} } +@article{arm2020, + title = {Neon {{Programmer Guide}} for {{Armv8-A Coding}} for {{Neon}}}, + author = {{ARM}}, + date = {2020-07-05}, + url = {https://developer.arm.com/documentation/102159/latest/}, + urldate = {2024-02-21}, + langid = {english}, + file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/QQI2QA25/2020 - Neon Programmer Guide for Armv8-A Coding for Neon.pdf} +} + @online{blas1979, title = {{{BLAS}} ({{Basic Linear Algebra Subprograms}})}, author = {{BLAS}}, @@ -92,7 +103,9 @@ @article{gao2017, title = {Bare-Metal {{Boot Code}} for {{ARMv8-A Processors}}}, author = {Gao, William}, - date = {2017}, + date = {2017-03-31}, + url = {https://developer.arm.com/documentation/dai0527/latest/}, + urldate = {2024-01-08}, langid = {english}, file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/FAN7NPUM/Gao - Bare-metal Boot Code for ARMv8-A Processors.pdf} }