diff --git a/src/abstract.tex b/src/abstract.tex index 555e5f2..77c2548 100644 --- a/src/abstract.tex +++ b/src/abstract.tex @@ -3,10 +3,10 @@ In our increasingly data-oriented world, machine learning applications such as \acp*{llm} for natural language processing are becoming more and more popular. An important component of these new systems are \acp*{dnn}. -To accelerate such \acsp*{dnn}, specialized processors such as \acp*{gpu} or \acp*{tpu} are mainly used, which can perform the required arithmetic operations more efficiently than \acp*{cpu}. +To accelerate the training and inference of such \acsp*{dnn}, specialized hardware accelerators such as \acp*{gpu} or \acp*{tpu} are mainly used, which can perform the required arithmetic operations more efficiently than a \ac*{cpu}. However, it turns out that the achievable performance of \acsp*{dnn} is less and less limited by the available computing power and more and more by the finite memory bandwidth of \acp*{dram}. A possible solution to this problem is the use of \ac*{pim}, which offloads some of the data processing directly into memory. -In this thesis, the real-world \acs*{pim} implementation \acl*{fimdram} of the major memory manufacturer Samsung is analyzed with the help of a newly developed software model and using the gem5 simulation platform and the DRAMSys memory simulator. +In this thesis, one of the first real-world \acs*{pim} implementations called \acl*{fimdram} of the memory manufacturer Samsung is analyzed with the help of a newly developed virtual prototype and using the gem5 simulation platform and the DRAMSys memory simulator. \vspace{1.0cm} @@ -14,9 +14,9 @@ In this thesis, the real-world \acs*{pim} implementation \acl*{fimdram} of the m In unserer zunehmend datenorientierten Welt gewinnen Anwendungen des maschinellen Lernens wie \acp*{llm} zur Verarbeitung von natürlicher Sprache an immer größerer Bedeutung. Eine wichtige Komponente dieser neuen Systeme sind \acp*{dnn}. -Zur Beschleunigung solcher \acsp*{dnn} werden vorwiegend spezialisierte Prozessoren wie \acp*{gpu} oder \acp*{tpu} eingesetzt, die die erforderten Rechenoperationen effizienter ausführen können als \acp*{cpu}. -Es zeigt sich jedoch, dass die Leistung von \acsp*{dnn} zunehmend weniger durch die erreichbare Rechenleistung als vielmehr durch die endliche Speicherbandbreite des \acp*{dram} begrenzt wird. +Zur Beschleunigung des Trainings und der Inferenz solcher \acsp*{dnn} werden hauptsächlich spezialisierte Hardware-Beschleuniger wie \acp*{gpu} oder \acp*{tpu} eingesetzt, die die erforderten Rechenoperationen effizienter ausführen können als eine \ac*{cpu}. +Es zeigt sich jedoch, dass die erreichbare Performanz von \acsp*{dnn} zunehmend weniger durch die verfügbare Rechenleistung als vielmehr durch die endliche Speicherbandbreite von \acp*{dram} begrenzt wird. Eine mögliche Lösung für dieses Problem ist die Verwendung von \ac*{pim}, das einen Teil der Datenverarbeitung direkt in den Speicher verlagert. -In dieser Arbeit wird die reale \acs*{pim}-Implementierung \acl*{fimdram} des großen Speicherherstellers Samsung mithilfe eines neu entwickelten Softwaremodells und unter Verwendung der Simulationsplattform gem5 und des Speichersimulators DRAMSys analysiert. +In dieser Arbeit wird eine der ersten realen \acs*{pim}-Implementierungen namens \acl*{fimdram} des Speicherherstellers Samsung mithilfe eines neu entwickelten virtuellen Prototyps und unter Verwendung der Simulationsplattform gem5 und des Speichersimulators DRAMSys analysiert. \end{abstract} diff --git a/src/appendix.tex b/src/appendix.tex index b4533d0..093b0df 100644 --- a/src/appendix.tex +++ b/src/appendix.tex @@ -227,7 +227,7 @@ EXIT \begin{listing}[H] \begin{minted}[linenos]{rust} -pub fn execute( +pub fn execute( matrix: &Matrix, input_vector: &Vector, output_partial_sum_vector: &mut SVector, diff --git a/src/chapters/conclusion.tex b/src/chapters/conclusion.tex index a741024..54f4ec8 100644 --- a/src/chapters/conclusion.tex +++ b/src/chapters/conclusion.tex @@ -32,8 +32,8 @@ A special compiler extension would be able to generate the necessary \ac{ld} and This extension could also make use of so-called non-temporal instructions, which bypass the cache hierarchy on a per-instruction basis instead of preallocating the entire \ac{pim}-enabled memory as non-cacheable. In addition to the performance comparison, further research should also model and compare the power efficiency gain of \ac{pim} to the non-\ac{pim} case. -Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually drive the memory data bus during operation, it promises good improvements in this area. -However, this would require a detailed performance model of both \aca{hbm} and \aca{fimdram}. +Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually transfer data out of the \ac{dram} and therefore does not need to drive the data bus during operation, it promises good improvements in this area. +However, such research would require a detailed power model of both \aca{hbm} and \aca{fimdram}. In conclusion, \ac{pim} is a promising approach to address the future processing and power efficiency needs of \ac{ai} and possibly other applications. Research needs to consider not only the architecture itself, but also the integration of \ac{pim} into applications at the software level. diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex index 254a055..38f910e 100644 --- a/src/chapters/introduction.tex +++ b/src/chapters/introduction.tex @@ -3,7 +3,7 @@ Emerging applications such as \acp{llm} and especially ChatGPT are revolutionizing modern computing and are changing the way we interact with computing systems. A key component of these models are \acp{dnn}, which are a type of machine learning model inspired by the structure of the human brain: -Composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are used to perform various tasks such as image recognition or natural language and speech processing. +composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are used to perform various tasks such as image recognition or natural language and speech processing. Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms. However, the ever-increasing use of these technologies poses new challenges on hardware architectures, as the energy required to train and run these models reaches unprecedented levels. @@ -26,7 +26,7 @@ It is therefore required to achieve radical improvements in the energy efficienc In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}. However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them: -While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost three orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. +while performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology may consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost three orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}. @@ -44,11 +44,11 @@ However, recent \ac{ai} technologies require even greater bandwidths than \ac{hb Overall, new approaches to computing are needed to meet the demand for more performant and energy-efficient computing systems. This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}. -\Ac{pim} integrates computational logic into the memory itself, to exploit the minimal data movement cost and the extensive internal data parallelism \cite{sudarshan2022}, making well-suited for memory-bound problems. +\Ac{pim} integrates computational logic into the memory itself, to exploit the minimal data movement cost and the extensive internal data parallelism \cite{sudarshan2022}, making it well-suited for memory-bound problems. This thesis analyzes various classes of \ac{pim} architectures and identifies the challenges of integrating them into state-of-the-art \acp{dram}. In particular, the real-world \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, \ac{fimdram}, is discussed in great detail. -The special memory layout required for the data structures of the input and output operands is analyzed so that the integrated \ac{pim} processing units can properly execute the specific arithmetic algorithms. +The special memory layout required for the data structures of the input and output operands is analyzed so that the integrated \ac{pim} processing units can properly execute the specific arithmetic operations. Furthermore, a \ac{vp} of \aca{fimdram} is developed and integrated into the \aca{hbm} model of the memory simulator DRAMSys. To be able to make use of the \ac{pim} model, a software library is implemented that takes care of the communication between the host processor and the \ac{pim} processing units, provides data structures to be used for the operand data, and defines functions to execute a programmed \ac{pim} kernel directly in memory. Finally, the gem5 simulation platform is used to build various user programs that make use of the software support library and implement a number of workloads that are accelerated using \ac{pim}. diff --git a/src/chapters/results.tex b/src/chapters/results.tex index b1ab403..58634fd 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -62,7 +62,7 @@ Namely, after the reduction step of the output vector, an activation function, i To evaluate the analysis objectives, this set of simulation workloads is each performed in four different configurations: With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved. -However, also two configurations with the same ARM processor but with a nearly infinite clock frequency is performed. +In addition, two configurations with the same ARM processor but with a nearly infinite clock frequency are used. While these configurations do not reflect a real system, they are used to address the previously mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}. With infinite computational power, the simulation is guaranteed to be limited only by the memory system, reducing the computation latencies introduced by the \ac{cpu}. This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment. @@ -96,7 +96,7 @@ The workloads adhere to the following calculation patterns: \end{itemize} Each workload is run with four different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable. -\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks. +\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks, where each element is a \ac{fp16} number. The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 2097152. To accurately evaluate the performance gain of \ac{pim}, it is important that the size of the input operand is significantly larger than the cache size of the simulated system, so that the cache does not filter the memory accesses to the \ac{dram}. In the case of the smallest dimension level, the effective data size of the input operands is $2^{21} \cdot 2 \cdot \qty{2}{\byte}=\qty{8}{\mebi\byte}$, which is much larger than the last-level cache of $\qty{256}{\kibi\byte}$. @@ -221,17 +221,17 @@ Since the speedup for \ac{gemv} approaches $\qty{63}{\times}$ and for \ac{dnn} $ \label{fig:matrix_infinite} \end{figure} -For the infinite compute approach, the \ac{gemv} and \ac{dnn} benchmarks however show a more differentiated view: +For the infinite compute approach, however, the \ac{gemv} and \ac{dnn} benchmarks show a more differentiated view, as seen in \cref{fig:matrix_infinite}: While the \ac{gemv} benchmark plateaus at around $\qty{9}{\times}$ for all matrix sizes, the usage of \ac{pim} slows the execution down up to a factor of $\qty{0.56}{\times}$ for the \ac{dnn} benchmark. However, the speedup quickly increases with the matrix larger dimensions, reaches its break-even point at the second step and shows a maximum speedup of $\qty{9.2}{\times}$ and $\qty{6.0}{\times}$ for the \ac{gemv} and \ac{dnn} benchmarks respectively. These results provide a more realistic view of \aca{fimdram}: -For workloads and accelerator systems that are truly memory-bound, performance improvements can be on the order of the simulated $\qty{9}{\times}$. +For workloads and accelerator systems that are truly memory-bound, performance improvements can be in the order of the simulated $\qty{9}{\times}$. This result is largely in line with the numbers published by Samsung, which were already introduced in \cref{sec:fimdram_performance} and will be compared in more detail with the simulation results in the next section. \subsubsection{Comparison to Samsung's Simulation Results} To reiterate, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \aca{fimdram} memory packages. -Similarly to the previous investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}. +Similar to the previous investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}. \begin{table} \centering @@ -274,13 +274,13 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc \label{fig:samsung_speedup} \end{figure} -The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. +The performed ADD microbenchmark of Samsung shows a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. Compared to the simulated platform, the variance is also limited, but the speedup is approximately $\qty{12.7}{\times}$, which is almost an order of magnitude higher than the findings of Samsung. This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}. Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}. However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain. -The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation. +The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation. In summary, the results for the VADD workload show some deviation from the real-world implementation of the system, while the \ac{gemv} workload shows a result that is consistent with it. \subsubsection{Comparison to Real Hardware} @@ -288,14 +288,14 @@ In summary, the results for the VADD workload show some deviation from the real- In addition to comparing Samsung's real hardware implementation, the same benchmarks of the simulations performed are run on two real \ac{gpu} systems, here referred to as Vega and Tesla. The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from NVIDIA, specifically tailored for \ac{hpc}. This Tesla \ac{gpu} is only one of the in total 16 \acp{gpu} that are part of the NVIDIA DGX-2 \ac{ai} workstation. -Both \acp{gpu} make use of \aca{hbm} and are therefore well suited to classify the simulation results and get an overview of the workload runtimes on a real system. +Both \acp{gpu} make use of \aca{hbm} and are therefore well suited to classify the simulation results and gain an impression of the workload runtimes on a real system. Since both systems use generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements should only be used as a rough estimate of the runtimes in a non-\ac{pim} case. -The Vega \ac{gpu} integrates $\qty{8}{\gibi\byte}$ of \aca{hbm} memory using two stacks, achieving a complete bus width of $\qty{2048}{\bit}$ with a total of 16 memory channels. +The Vega \ac{gpu} integrates $\qty{8}{\gibi\byte}$ of \aca{hbm} memory using two stacks, achieving a total bus width of $\qty{2048}{\bit}$ with a total of 16 memory channels. For the theoretical performance of \ac{fp16} operations, Vega achieves a value of $\qty{21.09}{\peta FLOPS}$ \cite{vega2017}. -While being specifically tailored for \ac{ai} applications, the Tesla \ac{gpu} integrates $\qty{32}{\gibi\byte}$ of \aca{hbm} using four stacks, resulting in a total bus width of $\qty{4096}{\bit}$ and 32 independent memory channels. -The theoretical performance of \ac{fp16} operations is only slightly higher than the of the Vega \ac{gpu} with a value of $\qty{28.26}{\peta FLOPS}$ \cite{tesla2018}. +While being specifically tailored to \ac{ai} applications, the Tesla \ac{gpu} integrates $\qty{32}{\gibi\byte}$ of \aca{hbm} using four stacks, resulting in a total bus width of $\qty{4096}{\bit}$ and 32 independent memory channels. +The theoretical performance of \ac{fp16} operations is only slightly higher than that of the Vega \ac{gpu} with a value of $\qty{28.26}{\peta FLOPS}$ \cite{tesla2018}. \begin{figure} \centering @@ -318,7 +318,7 @@ The theoretical performance of \ac{fp16} operations is only slightly higher than A comparison between all investigated systems for the vector benchmarks is shown in \cref{fig:runtimes_vector}. As it can be seen, for both the generic ARM system, and the infinite compute system, the usage of \ac{pim} reduces the total runtime significantly. However, when comparing the two \ac{gpu} systems with the infinite compute approach, it can be seen that the runtime of the \acp{gpu} is actually lower, even though the infinite compute approach should be an optimal memory-bound system. -It is important to note that while the simulation considered only one memory channel, the real \acp{gpu} could distribute all operations over all available channels, 16 channels in the case of Vega and 32 channels in the case of Tesla. +It is important to note that while the simulation considered only one memory channel, the real \acp{gpu} could distribute all operations over all available channels, i.e., 16 channels in the case of Vega and 32 channels in the case of Tesla. This results in significantly higher overall memory bandwidth for the \ac{gpu} systems. It should also be noted that the \aca{hbm} memory of the Tesla \ac{gpu} is clocked at a slightly higher frequency of $\qty{876}{\mega\hertz}$ than the Vega \ac{gpu} with a frequency of $\qty{800}{\mega\hertz}$ \cite{vega2017,tesla2018}. In addition, while the infinite compute system does not use any computing time, it may still need to stall and wait for memory requests to complete. diff --git a/src/plots/matrix_infinite.tex b/src/plots/matrix_infinite.tex index 91f42c7..57f1570 100644 --- a/src/plots/matrix_infinite.tex +++ b/src/plots/matrix_infinite.tex @@ -7,7 +7,7 @@ ymin=0, ymax=10, ymajorgrids, - ylabel={Relative Performance}, + ylabel={Speedup}, tick pos=left, xtick=data, xticklabels from table={\csv}{level}, diff --git a/src/plots/matrix_normal.tex b/src/plots/matrix_normal.tex index f80e32e..7152b42 100644 --- a/src/plots/matrix_normal.tex +++ b/src/plots/matrix_normal.tex @@ -7,7 +7,7 @@ ymin=0, ymax=80, ymajorgrids, - ylabel={Relative Performance}, + ylabel={Speedup}, tick pos=left, xtick=data, xticklabels from table={\csv}{level}, diff --git a/src/plots/vector_infinite.tex b/src/plots/vector_infinite.tex index 91baebe..1b59fba 100644 --- a/src/plots/vector_infinite.tex +++ b/src/plots/vector_infinite.tex @@ -7,7 +7,7 @@ ymin=0, ymax=25, ymajorgrids, - ylabel={Relative Performance}, + ylabel={Speedup}, tick pos=left, xtick=data, xticklabels from table={\csv}{level}, diff --git a/src/plots/vector_normal.tex b/src/plots/vector_normal.tex index a643618..66a97fb 100644 --- a/src/plots/vector_normal.tex +++ b/src/plots/vector_normal.tex @@ -7,7 +7,7 @@ ymin=0, ymax=25, ymajorgrids, - ylabel={Relative Performance}, + ylabel={Speedup}, tick pos=left, xtick=data, xticklabels from table={\csv}{level},