diff --git a/src/abstract.tex b/src/abstract.tex index e7560c7..555e5f2 100644 --- a/src/abstract.tex +++ b/src/abstract.tex @@ -1,7 +1,7 @@ \begin{abstract} \section*{Abstract} -In our increasingly data-oriented world, machine learning applications such as \acp*{llm} for natural language processing are becoming more and more important. +In our increasingly data-oriented world, machine learning applications such as \acp*{llm} for natural language processing are becoming more and more popular. An important component of these new systems are \acp*{dnn}. To accelerate such \acsp*{dnn}, specialized processors such as \acp*{gpu} or \acp*{tpu} are mainly used, which can perform the required arithmetic operations more efficiently than \acp*{cpu}. However, it turns out that the achievable performance of \acsp*{dnn} is less and less limited by the available computing power and more and more by the finite memory bandwidth of \acp*{dram}. diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex index aacbc81..85436ed 100644 --- a/src/chapters/introduction.tex +++ b/src/chapters/introduction.tex @@ -1,17 +1,17 @@ \section{Introduction} \label{sec:introduction} -Emerging applications such as \acp{llm} revolutionize modern computing and fundamentally change how we interact with computing systems. -A key component of these models is the use of \acp{dnn}, which are a type of machine learning model inspired by the structure of the human brain - composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are utilized to perform various tasks such as image recognition or natural language and speech processing. +Emerging applications such as \acp{llm} and especially ChatGPT are revolutionizing modern computing and are changing the way we interact with computing systems. +A key component of these models are \acp{dnn}, which are a type of machine learning model inspired by the structure of the human brain: +Composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are used to perform various tasks such as image recognition or natural language and speech processing. Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms. -However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels. +However, the ever-increasing use of these technologies poses new challenges on hardware architectures, as the energy required to train and run these models reaches unprecedented levels. Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about five months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}. As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}. - -In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}. -This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing however is outpacing this progress. +In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}, which is shown in \cref{plt:enery_chart}. +This drastic increase in energy consumption is due to the fact that although the energy efficiency of computing units has continuously improved, the ever-increasing demand for computing power outpaces this progress. In addition, Moore's Law is slowing down as further device scaling approaches physical limits. \begin{figure}[!ht] @@ -21,15 +21,16 @@ In addition, Moore's Law is slowing down as further device scaling approaches ph \label{plt:enery_chart} \end{figure} -The exponential grow in compute energy will eventually be constrained by market dynamics, flattening the energy curve and making it impossible to meet future computing demands. -It is therefore required to achieve radical improvements in energy efficiency in order to avoid such a scenario. +The exponential increase in compute energy will eventually be constrained by market dynamics, flattening the energy curve and making it impossible to meet future computing demands. +It is therefore required to achieve radical improvements in the energy efficiency of computing systems in order to avoid such a scenario. -In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for \ac{ai} applications than general-purpose processors \cite{kwon2021}. -However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself. +In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}. +However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them: While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}. +\Cref{plt:roofline} shows the so-called roofline model for two generations of OpenAI's GPT models, highlighting the compute-bound and memory-bound regions. \begin{figure}[!ht] \centering @@ -39,17 +40,23 @@ In contrast, compute-intensive workloads, such as visual processing, are referre \end{figure} In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements. -However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm} can provide \cite{kwon2021}. +However, recent \ac{ai} technologies require even greater bandwidths than \ac{hbm} can provide \cite{kwon2021}. -All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bound, new approaches to computing are required. +Overall, new approaches to computing are needed to meet the demand for more performant and energy-efficient computing systems. This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}. -\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bound problems. +\Ac{pim} integrates computational logic into the memory itself, to exploit the minimal data movement cost and the extensive internal data parallelism \cite{sudarshan2022}, making well-suited for memory-bound problems. + +This thesis analyzes various classes of \ac{pim} architectures and identifies the challenges of integrating them into state-of-the-art \acp{dram}. +In particular, the real-world \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, \ac{fimdram}, is discussed in great detail. +The special memory layout required for the data structures of the input and output operands is analyzed so that the integrated \ac{pim} processing units can properly execute the specific arithmetic algorithms. +Furthermore, a \ac{vp} of \aca{fimdram} is developed and integrated into the \aca{hbm} model of the memory simulator DRAMSys. +To be able to make use of the \ac{pim} model, a software library is implemented that takes care of the communication between the host processor and the \ac{pim} processing units, provides data structures to be used for the operand data, and defines functions to execute a programmed \ac{pim} kernel directly in memory. +Finally, the gem5 simulation platform is used to build various user programs that make use of the software support library and implement a number of workloads that are accelerated using \ac{pim}. -This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors. The remainder of this work is structured as follows: -\cref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \ac{hbm}. +\Cref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \aca{hbm}. In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail. -\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation. -After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing. -The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural network inference. -Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures. +\Cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation. +After explaining the necessary prerequisites, \cref{sec:implementation} implements the \ac{pim} model in software and provides a development library that applications can use to take advantage of the in-memory processing. +\Cref{sec:results} demonstrates the possible performance enhancements of \ac{pim} by simulating various microbenchmarks and a simplified neural network inference. +Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in the \ac{pim} model and software stack. diff --git a/src/plots/energy_chart.tex b/src/plots/energy_chart.tex index 12da59d..2a9b57d 100644 --- a/src/plots/energy_chart.tex +++ b/src/plots/energy_chart.tex @@ -17,6 +17,7 @@ xtick={2010,2020,2030,2040,2050}, ytick={1e16,1e18,1e20,1e22}, xticklabel style={/pgf/number format/1000 sep=}, + xlabel={Year}, ylabel={Compute Energy in $\si{\joule\per Year}$}, xmin=2010, xmax=2050,