Refactor Introduction
This commit is contained in:
@@ -1,7 +1,7 @@
|
|||||||
\begin{abstract}
|
\begin{abstract}
|
||||||
\section*{Abstract}
|
\section*{Abstract}
|
||||||
|
|
||||||
In our increasingly data-oriented world, machine learning applications such as \acp*{llm} for natural language processing are becoming more and more important.
|
In our increasingly data-oriented world, machine learning applications such as \acp*{llm} for natural language processing are becoming more and more popular.
|
||||||
An important component of these new systems are \acp*{dnn}.
|
An important component of these new systems are \acp*{dnn}.
|
||||||
To accelerate such \acsp*{dnn}, specialized processors such as \acp*{gpu} or \acp*{tpu} are mainly used, which can perform the required arithmetic operations more efficiently than \acp*{cpu}.
|
To accelerate such \acsp*{dnn}, specialized processors such as \acp*{gpu} or \acp*{tpu} are mainly used, which can perform the required arithmetic operations more efficiently than \acp*{cpu}.
|
||||||
However, it turns out that the achievable performance of \acsp*{dnn} is less and less limited by the available computing power and more and more by the finite memory bandwidth of \acp*{dram}.
|
However, it turns out that the achievable performance of \acsp*{dnn} is less and less limited by the available computing power and more and more by the finite memory bandwidth of \acp*{dram}.
|
||||||
|
|||||||
@@ -1,17 +1,17 @@
|
|||||||
\section{Introduction}
|
\section{Introduction}
|
||||||
\label{sec:introduction}
|
\label{sec:introduction}
|
||||||
|
|
||||||
Emerging applications such as \acp{llm} revolutionize modern computing and fundamentally change how we interact with computing systems.
|
Emerging applications such as \acp{llm} and especially ChatGPT are revolutionizing modern computing and are changing the way we interact with computing systems.
|
||||||
A key component of these models is the use of \acp{dnn}, which are a type of machine learning model inspired by the structure of the human brain - composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are utilized to perform various tasks such as image recognition or natural language and speech processing.
|
A key component of these models are \acp{dnn}, which are a type of machine learning model inspired by the structure of the human brain:
|
||||||
|
Composed of multiple layers of interconnected nodes that mimic a network of neurons, \acp{dnn} are used to perform various tasks such as image recognition or natural language and speech processing.
|
||||||
Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms.
|
Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms.
|
||||||
|
|
||||||
However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels.
|
However, the ever-increasing use of these technologies poses new challenges on hardware architectures, as the energy required to train and run these models reaches unprecedented levels.
|
||||||
Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about five months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}.
|
Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about five months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}.
|
||||||
As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}.
|
As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}.
|
||||||
|
|
||||||
|
In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}, which is shown in \cref{plt:enery_chart}.
|
||||||
In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}.
|
This drastic increase in energy consumption is due to the fact that although the energy efficiency of computing units has continuously improved, the ever-increasing demand for computing power outpaces this progress.
|
||||||
This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing however is outpacing this progress.
|
|
||||||
In addition, Moore's Law is slowing down as further device scaling approaches physical limits.
|
In addition, Moore's Law is slowing down as further device scaling approaches physical limits.
|
||||||
|
|
||||||
\begin{figure}[!ht]
|
\begin{figure}[!ht]
|
||||||
@@ -21,15 +21,16 @@ In addition, Moore's Law is slowing down as further device scaling approaches ph
|
|||||||
\label{plt:enery_chart}
|
\label{plt:enery_chart}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
The exponential grow in compute energy will eventually be constrained by market dynamics, flattening the energy curve and making it impossible to meet future computing demands.
|
The exponential increase in compute energy will eventually be constrained by market dynamics, flattening the energy curve and making it impossible to meet future computing demands.
|
||||||
It is therefore required to achieve radical improvements in energy efficiency in order to avoid such a scenario.
|
It is therefore required to achieve radical improvements in the energy efficiency of computing systems in order to avoid such a scenario.
|
||||||
|
|
||||||
In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for \ac{ai} applications than general-purpose processors \cite{kwon2021}.
|
In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}.
|
||||||
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
|
However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them:
|
||||||
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
|
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
|
||||||
|
|
||||||
Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}.
|
Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}.
|
||||||
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.
|
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.
|
||||||
|
\Cref{plt:roofline} shows the so-called roofline model for two generations of OpenAI's GPT models, highlighting the compute-bound and memory-bound regions.
|
||||||
|
|
||||||
\begin{figure}[!ht]
|
\begin{figure}[!ht]
|
||||||
\centering
|
\centering
|
||||||
@@ -39,17 +40,23 @@ In contrast, compute-intensive workloads, such as visual processing, are referre
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements.
|
In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements.
|
||||||
However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm} can provide \cite{kwon2021}.
|
However, recent \ac{ai} technologies require even greater bandwidths than \ac{hbm} can provide \cite{kwon2021}.
|
||||||
|
|
||||||
All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bound, new approaches to computing are required.
|
Overall, new approaches to computing are needed to meet the demand for more performant and energy-efficient computing systems.
|
||||||
This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}.
|
This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}.
|
||||||
\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bound problems.
|
\Ac{pim} integrates computational logic into the memory itself, to exploit the minimal data movement cost and the extensive internal data parallelism \cite{sudarshan2022}, making well-suited for memory-bound problems.
|
||||||
|
|
||||||
|
This thesis analyzes various classes of \ac{pim} architectures and identifies the challenges of integrating them into state-of-the-art \acp{dram}.
|
||||||
|
In particular, the real-world \ac{pim} implementation of the major \ac{dram} manufacturer Samsung, \ac{fimdram}, is discussed in great detail.
|
||||||
|
The special memory layout required for the data structures of the input and output operands is analyzed so that the integrated \ac{pim} processing units can properly execute the specific arithmetic algorithms.
|
||||||
|
Furthermore, a \ac{vp} of \aca{fimdram} is developed and integrated into the \aca{hbm} model of the memory simulator DRAMSys.
|
||||||
|
To be able to make use of the \ac{pim} model, a software library is implemented that takes care of the communication between the host processor and the \ac{pim} processing units, provides data structures to be used for the operand data, and defines functions to execute a programmed \ac{pim} kernel directly in memory.
|
||||||
|
Finally, the gem5 simulation platform is used to build various user programs that make use of the software support library and implement a number of workloads that are accelerated using \ac{pim}.
|
||||||
|
|
||||||
This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors.
|
|
||||||
The remainder of this work is structured as follows:
|
The remainder of this work is structured as follows:
|
||||||
\cref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \ac{hbm}.
|
\Cref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \aca{hbm}.
|
||||||
In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
|
In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
|
||||||
\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
|
\Cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
|
||||||
After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
|
After explaining the necessary prerequisites, \cref{sec:implementation} implements the \ac{pim} model in software and provides a development library that applications can use to take advantage of the in-memory processing.
|
||||||
The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural network inference.
|
\Cref{sec:results} demonstrates the possible performance enhancements of \ac{pim} by simulating various microbenchmarks and a simplified neural network inference.
|
||||||
Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures.
|
Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in the \ac{pim} model and software stack.
|
||||||
|
|||||||
@@ -17,6 +17,7 @@
|
|||||||
xtick={2010,2020,2030,2040,2050},
|
xtick={2010,2020,2030,2040,2050},
|
||||||
ytick={1e16,1e18,1e20,1e22},
|
ytick={1e16,1e18,1e20,1e22},
|
||||||
xticklabel style={/pgf/number format/1000 sep=},
|
xticklabel style={/pgf/number format/1000 sep=},
|
||||||
|
xlabel={Year},
|
||||||
ylabel={Compute Energy in $\si{\joule\per Year}$},
|
ylabel={Compute Energy in $\si{\joule\per Year}$},
|
||||||
xmin=2010,
|
xmin=2010,
|
||||||
xmax=2050,
|
xmax=2050,
|
||||||
|
|||||||
Reference in New Issue
Block a user