Conclusion erste Version

This commit is contained in:
2024-02-17 10:58:47 +01:00
parent 0bdbd2ddc9
commit 779461b515
4 changed files with 33 additions and 6 deletions

View File

@@ -5,8 +5,8 @@ In our increasingly data-oriented world, machine learning applications such as \
An important component of these new systems are \acp*{dnn}. An important component of these new systems are \acp*{dnn}.
Specialized processors such as \acp*{gpu} or \acp*{tpu} were used in the past to accelerate the operation of such \acsp*{dnn}. Specialized processors such as \acp*{gpu} or \acp*{tpu} were used in the past to accelerate the operation of such \acsp*{dnn}.
However, it has become apparent that the performance of \acsp*{dnn} is increasingly limited less by the computing power provided, but rather by the limited memory bandwidth of the \acp*{dram}. However, it has become apparent that the performance of \acsp*{dnn} is increasingly limited less by the computing power provided, but rather by the limited memory bandwidth of the \acp*{dram}.
One possible solution to this problem is the use of \ac*{pim}, i.e. the processing of data directly in memory. One possible solution to this problem is the use of \ac*{pim}, i.e., the processing of data directly in memory.
This paper examines which applications are suitable for the use of \acs*{pim} and what effects on performance can be expected. This paper examines which applications are suitable for the use of \acs*{pim} and what effects on the performance can be expected.
\vspace{1.0cm} \vspace{1.0cm}

View File

@@ -332,3 +332,7 @@
short = LLFF, short = LLFF,
long = Linked List First Fit, long = Linked List First Fit,
} }
\DeclareAcronym{hugetlb}{
short = HugeTLB,
long = huge page table,
}

View File

@@ -1,11 +1,34 @@
\section{Conclusion and Future Work} \section{Conclusion and Future Work}
\label{sec:conclusion} \label{sec:conclusion}
In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications.
A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail.
The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed.
A working \ac{vp} of \aca{fimdram}, in the form of a software model, was developed, making it possible to explore the performance gain of \ac{pim} for various different applications in an easy and flexible way.
It was found that, ... (TODO: hier Ergebnisse).
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work.
First, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode-switching and transferring of the microkernels.
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$.
For a better evaluation of the performance gains of \aca{fimdram}, it should be compared with real-world \ac{dnn} applications.
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications.
In conclusion, \ac{pim} is a promising approach to address the future processing needs of \ac{ai} and possibly other applications.
Not only the architecture itself has to be considered, but also the integration of \ac{pim} into the applications at the software level.
By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms.
% what to do better: % what to do better:
% implement samsungs real mode switching and programming of crfs % implement samsungs real mode switching and programming of crfs
% build an api that guarantees matching LD and ST for the assembled microkernel % build an api that guarantees matching LD and ST for the assembled microkernel
% implement linux kernel driver % implement linux kernel driver
% -> alignment requirements -> huge tables % -> alignment requirements -> huge tables
% make use of sasmsung pim in a real dnn application and measure the effects % make use of sasmsung pim in a real dnn application and measure the effects
% compare with SIMD insts in ARM % compare with SIMD insts in ARM
% compare with real TPUs and GPU platforms % compare with real TPUs and GPU platforms

View File

@@ -6,12 +6,12 @@ A key component of these models is the use of \acp{dnn}, which are a type of mac
Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms. Consequently, \acp{dnn} make it possible to tackle many new classes of problems that were previously beyond the reach of conventional algorithms.
However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels. However, the ever-increasing use of these technologies poses new challenges for hardware architectures, as the energy required to train and run these models reaches unprecedented levels.
Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about 5 months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}. Recently published numbers approximate that the development and training of Meta's LLaMA model over a period of about five months consumed around $\qty{2638}{\mega\watt\hour}$ of electrical energy and caused a total emission of $\qty{1015}{tCO_2eq}$ \cite{touvron2023}.
As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}. As these numbers are expected to increase in the future, it is clear that the energy footprint of current deployment of \ac{ai} applications is not sustainable \cite{blott2023}.
In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}. In a more general view, the energy demand of computing for new applications continues to grow exponentially, doubling about every two years, while the world's energy production only grows linearly, at about $\qty{2}{\percent}$ per year \cite{src2021}.
This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing is outpacing this progress. This dramatic increase in energy consumption is due to the fact that while the energy efficiency of compute processor units has continued to improve, the ever-increasing demand for computing however is outpacing this progress.
In addition, Moore's Law is slowing down as further device scaling approaches physical limits. In addition, Moore's Law is slowing down as further device scaling approaches physical limits.
\begin{figure}[!ht] \begin{figure}[!ht]
@@ -28,7 +28,7 @@ In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} ha
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself. However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
Furthermore, many types of \ac{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}.
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}.
\begin{figure}[!ht] \begin{figure}[!ht]
@@ -51,5 +51,5 @@ The remainder of this work is structured as follows:
In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail. In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation. \cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing. After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural-network inference. The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural network inference.
Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures. Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures.