Conclusion erste Version
This commit is contained in:
@@ -5,8 +5,8 @@ In our increasingly data-oriented world, machine learning applications such as \
|
|||||||
An important component of these new systems are \acp*{dnn}.
|
An important component of these new systems are \acp*{dnn}.
|
||||||
Specialized processors such as \acp*{gpu} or \acp*{tpu} were used in the past to accelerate the operation of such \acsp*{dnn}.
|
Specialized processors such as \acp*{gpu} or \acp*{tpu} were used in the past to accelerate the operation of such \acsp*{dnn}.
|
||||||
However, it has become apparent that the performance of \acsp*{dnn} is increasingly limited less by the computing power provided, but rather by the limited memory bandwidth of the \acp*{dram}.
|
However, it has become apparent that the performance of \acsp*{dnn} is increasingly limited less by the computing power provided, but rather by the limited memory bandwidth of the \acp*{dram}.
|
||||||
One possible solution to this problem is the use of \ac*{pim}, i.e. the processing of data directly in memory.
|
One possible solution to this problem is the use of \ac*{pim}, i.e., the processing of data directly in memory.
|
||||||
This paper examines which applications are suitable for the use of \acs*{pim} and what effects on performance can be expected.
|
This paper examines which applications are suitable for the use of \acs*{pim} and what effects on the performance can be expected.
|
||||||
|
|
||||||
\vspace{1.0cm}
|
\vspace{1.0cm}
|
||||||
|
|
||||||
|
|||||||
@@ -332,3 +332,7 @@
|
|||||||
short = LLFF,
|
short = LLFF,
|
||||||
long = Linked List First Fit,
|
long = Linked List First Fit,
|
||||||
}
|
}
|
||||||
|
\DeclareAcronym{hugetlb}{
|
||||||
|
short = HugeTLB,
|
||||||
|
long = huge page table,
|
||||||
|
}
|
||||||
|
|||||||
@@ -1,11 +1,34 @@
|
|||||||
\section{Conclusion and Future Work}
|
\section{Conclusion and Future Work}
|
||||||
\label{sec:conclusion}
|
\label{sec:conclusion}
|
||||||
|
|
||||||
|
In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications.
|
||||||
|
A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail.
|
||||||
|
The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed.
|
||||||
|
A working \ac{vp} of \aca{fimdram}, in the form of a software model, was developed, making it possible to explore the performance gain of \ac{pim} for various different applications in an easy and flexible way.
|
||||||
|
It was found that, ... (TODO: hier Ergebnisse).
|
||||||
|
|
||||||
|
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work.
|
||||||
|
First, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode-switching and transferring of the microkernels.
|
||||||
|
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
|
||||||
|
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
|
||||||
|
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
|
||||||
|
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
|
||||||
|
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$.
|
||||||
|
|
||||||
|
For a better evaluation of the performance gains of \aca{fimdram}, it should be compared with real-world \ac{dnn} applications.
|
||||||
|
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.
|
||||||
|
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications.
|
||||||
|
|
||||||
|
In conclusion, \ac{pim} is a promising approach to address the future processing needs of \ac{ai} and possibly other applications.
|
||||||
|
Not only the architecture itself has to be considered, but also the integration of \ac{pim} into the applications at the software level.
|
||||||
|
By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms.
|
||||||
|
|
||||||
% what to do better:
|
% what to do better:
|
||||||
% implement samsungs real mode switching and programming of crfs
|
% implement samsungs real mode switching and programming of crfs
|
||||||
% build an api that guarantees matching LD and ST for the assembled microkernel
|
% build an api that guarantees matching LD and ST for the assembled microkernel
|
||||||
% implement linux kernel driver
|
% implement linux kernel driver
|
||||||
% -> alignment requirements -> huge tables
|
% -> alignment requirements -> huge tables
|
||||||
|
|
||||||
% make use of sasmsung pim in a real dnn application and measure the effects
|
% make use of sasmsung pim in a real dnn application and measure the effects
|
||||||
% compare with SIMD insts in ARM
|
% compare with SIMD insts in ARM
|
||||||
% compare with real TPUs and GPU platforms
|
% compare with real TPUs and GPU platforms
|
||||||
|
|||||||
@@ -6,12 +6,12 @@ A key component of these models is the use of \acp{dnn}, which are a type of mac
|
|||||||
Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms.
|
Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms.
|
||||||
|
|
||||||
However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels.
|
However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels.
|
||||||
Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about 5 months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}.
|
Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about five months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}.
|
||||||
As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}.
|
As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}.
|
||||||
|
|
||||||
|
|
||||||
In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}.
|
In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}.
|
||||||
This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing is outpacing this progress.
|
This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing however is outpacing this progress.
|
||||||
In addition, Moore's Law is slowing down as further device scaling approaches physical limits.
|
In addition, Moore's Law is slowing down as further device scaling approaches physical limits.
|
||||||
|
|
||||||
\begin{figure}[!ht]
|
\begin{figure}[!ht]
|
||||||
@@ -28,7 +28,7 @@ In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} ha
|
|||||||
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
|
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
|
||||||
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
|
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
|
||||||
|
|
||||||
Furthermore, many types of \ac{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}.
|
Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}.
|
||||||
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}.
|
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}.
|
||||||
|
|
||||||
\begin{figure}[!ht]
|
\begin{figure}[!ht]
|
||||||
@@ -51,5 +51,5 @@ The remainder of this work is structured as follows:
|
|||||||
In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
|
In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
|
||||||
\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
|
\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
|
||||||
After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
|
After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
|
||||||
The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural-network inference.
|
The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural network inference.
|
||||||
Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures.
|
Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures.
|
||||||
|
|||||||
Reference in New Issue
Block a user