diff --git a/src/acronyms.tex b/src/acronyms.tex index 1c7b303..a8f3620 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -348,3 +348,7 @@ short = HugeTLB, long = huge page table, } +\DeclareAcronym{haxpy}{ + short = HAXPY, + long = half precision $a \cdot x + y$, +} diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex index 9656fe5..456597e 100644 --- a/src/chapters/implementation/kernel.tex +++ b/src/chapters/implementation/kernel.tex @@ -66,7 +66,7 @@ Each granule size has a different maximum amount of page table nesting, with up As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table. In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table. In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely. -While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary. +While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this is not necessary. For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks. In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated. To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again. diff --git a/src/chapters/results.tex b/src/chapters/results.tex index 2746520..6203539 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -2,11 +2,11 @@ \label{sec:results} This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads. -After a brief introduction to the simulated system architecture, an estimated theoretical performance gain is calculated. +After a brief introduction of the simulated system architecture, an estimated theoretical performance gain is calculated. This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment. Furthermore, the variations of the system parameters for each workload will be explored. -The set of simulations is then run based on these parameters and the resulting performance improvements are analyzed. -Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}. +A set of simulations is then run based on these parameters and the resulting performance improvements are analyzed. +% Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}. \subsection{System Architecture} The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}. @@ -15,8 +15,8 @@ A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}. Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$. In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$. -To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. -Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel. +To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. +Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel. % some implementation details % hbm size, channel... @@ -29,11 +29,11 @@ When interpreting the following simulation results, it is important to note that Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way. The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed. Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}. -These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate inference by performing many parallel operations, or a \ac{tpu} may use specialized nets for matrix vector operations such as systolic arrays. -Such differences would also reflect themselves in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}. +These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate the \ac{dnn} inference by performing many parallel operations, or a \ac{tpu} may use a specialized architecture of nets, such as systolic arrays, to accelerate matrix vector operations. +Those differences would also be reflected in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}. Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands. The high overhead of a \ac{pim} operation on a small data set may be underrepresented. -Nevertheless, the simulations performed provide an informative insight on the effectiveness of \aca{fimdram} and its suited workloads. +Nevertheless, the simulations performed provide an informative insight into the effectiveness of \aca{fimdram} and its suitability for various workloads. % bare-metal ist optimalfall, linux wäre realistischere testumgebung % overhead der setuptime kann nicht richtig gemessen werden @@ -41,23 +41,28 @@ Nevertheless, the simulations performed provide an informative insight on the ef \subsection{Objectives} Through the simulations, the research aims to address and find answers to several objectives. -As already discussed in \cref{sec:pim}, \ac{pim} aims to accellerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}. -This difference should be analyzed by performing the simulations on various different workloads. -For these workloads, the input dimensions may play an important role in how effective \ac{pim} is. +As already discussed in \cref{sec:pim}, \ac{pim} aims to accelerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}. +The potential speedup of \aca{fimdram} should be analyzed by performing the simulations on various different workloads. +For these workloads, the dimensions of the input operands may play an important role in how effective \ac{pim} is. Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant. The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable. -When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer. -\Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers. -The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated. -Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area. -Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit. -To evaluate these objectives, the set of simulations is each employed in four different configurations: +Specifically, bulk vector additions and multiplications are executed, as well as level 1 \ac{blas} \ac{haxpy} operations. +To model the inference of a \ac{dnn}, a singular \ac{gemv} operation is first performed, followed by a simple model of a sequence of multiple \ac{dnn} layers, including the necessary processing steps between the \ac{gemv} routines. +Namely, after the reduction step of the output vector, an activation function, i.e. \ac{relu}, is applied before the vector is passed as input to the next layer. + +% When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer. +% \Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers. +% The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated. +% Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area. +% Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit. + +To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations: With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved. However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed. -While these configurations do not reflect a real system, they are used to address the already mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}. -With infinite computational power, the simulation is guaranteed to be bounded only by the memory system. -This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundedness can be observed. +While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}. +With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}. +This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundness can be observed. % different kernels % shared pim units (-> halbe rows / halbe performance, soll überprüft werden) @@ -78,10 +83,10 @@ This allows an exaggerated evaluation of the performance gains of \ac{pim} in an % GEMM mit stark interleavten matrizen -\subsubsection{Initialization Overhead} +% \subsubsection{Initialization Overhead} % conversion der operanden im verhältnis zur laufzeit abschätzen -\subsubsection{Shared Processing Units} +% \subsubsection{Shared Processing Units} % Geteilte processing units vs jede Bank eine % GEMV diff --git a/src/images/dnn.tex b/src/images/dnn.tex index 3c610ed..68e2629 100644 --- a/src/images/dnn.tex +++ b/src/images/dnn.tex @@ -10,30 +10,30 @@ \node[circle,draw=black,fill=ForestGreen!50,minimum size=5mm] (onode3) [below of=onode2] {$o_3$}; \node[circle,draw=black,fill=ForestGreen!60,minimum size=5mm] (onode4) [below of=onode3] {$o_4$}; - \draw (inode0.east) to (onode0.west); - \draw (inode1.east) to (onode0.west); - \draw (inode2.east) to (onode0.west); - \draw (inode3.east) to (onode0.west); + \draw[black!75] (inode0.east) to (onode0.west); + \draw[black!75] (inode1.east) to (onode0.west); + \draw[black!75] (inode2.east) to (onode0.west); + \draw[black!75] (inode3.east) to (onode0.west); - \draw (inode0.east) to (onode2.west); - \draw (inode1.east) to (onode2.west); - \draw (inode2.east) to (onode2.west); - \draw (inode3.east) to (onode2.west); + \draw[black!75] (inode0.east) to (onode2.west); + \draw[black!75] (inode1.east) to (onode2.west); + \draw[black!75] (inode2.east) to (onode2.west); + \draw[black!75] (inode3.east) to (onode2.west); - \draw (inode0.east) to (onode3.west); - \draw (inode1.east) to (onode3.west); - \draw (inode2.east) to (onode3.west); - \draw (inode3.east) to (onode3.west); + \draw[black!75] (inode0.east) to (onode3.west); + \draw[black!75] (inode1.east) to (onode3.west); + \draw[black!75] (inode2.east) to (onode3.west); + \draw[black!75] (inode3.east) to (onode3.west); - \draw (inode0.east) to (onode4.west); - \draw (inode1.east) to (onode4.west); - \draw (inode2.east) to (onode4.west); - \draw (inode3.east) to (onode4.west); + \draw[black!75] (inode0.east) to (onode4.west); + \draw[black!75] (inode1.east) to (onode4.west); + \draw[black!75] (inode2.east) to (onode4.west); + \draw[black!75] (inode3.east) to (onode4.west); - \draw[red!60,thick] (inode0.east) to (onode1.west); - \draw[red!60,thick] (inode1.east) to (onode1.west); - \draw[red!60,thick] (inode2.east) to (onode1.west); - \draw[red!60,thick] (inode3.east) to (onode1.west); + \draw[red!60,thick] (inode0.east) -- (onode1.west) node [midway,above,sloped] {$w_{1,0}$}; + \draw[red!60,thick] (inode1.east) -- (onode1.west);%node [midway,above,sloped] {\tiny $w_{1,1}$}; + \draw[red!60,thick] (inode2.east) -- (onode1.west);%node [midway,above,sloped] {\tiny $w_{1,2}$}; + \draw[red!60,thick] (inode3.east) -- (onode1.west);%node [midway,below,sloped] {\tiny $w_{1,3}$}; \matrix (matrix) [matrix of nodes,left delimiter=(,right delimiter=),right=1.5cm of onode2] { $w_{0,0}$ & $w_{0,1}$ & $w_{0,2}$ & $w_{0,3}$ \\