Complete the result chapter and conclusion

This commit is contained in:
2024-03-08 18:48:22 +01:00
parent 4074a60f43
commit f0956a6246
9 changed files with 82 additions and 53 deletions

View File

@@ -4,27 +4,38 @@
In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications. In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications.
A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail. A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail.
The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed. The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed.
A working \ac{vp} of \aca{fimdram}, in the form of a software model, was developed, making it possible to explore the performance gain of \ac{pim} for various different applications in an easy and flexible way. A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been developed, as well as a software support library to enable the use of the \aca{fimdram} processing units from a user application.
It was found that, ... (TODO: hier Ergebnisse). This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work. It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, used to implement the mode switching and transferring of the microkernels. While these results may not strictly represent a real-world system, an achievable upper bound of speedups of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.
However, there is still room for improvement in the software model and the comparison methodology, which will be the subject of future work.
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, which is used to implement the mode switching and the transfer of the microkernels.
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}. For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel. To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
The current bare-metal deployment of the user application cannot realistically be used to accelerate complex real-world \ac{dnn} applications.
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver. Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$. To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$.
For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications. For a better evaluation of the performance gains of \aca{fimdram}, it should then be compared with real-world \ac{dnn} applications.
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment. Effects such as the initialization overhead of \aca{fimdram} can only be realistically evaluated in such an environment.
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications. Furthermore, the support software implementation for \aca{fimdram} should be extended to execute on the provided \ac{gpu} of gem5, so that the comparison can be extended to the deployment of real \ac{dnn} applications.
This would provide a considerably better basis for analyzing the effects of \ac{pim} on real applications running on representative hardware models.
Further research could also investigate whether the library-based approach of leveraging \ac{pim} could be replaced by a compiler-based approach. Further research could also investigate whether the library-based approach of leveraging \ac{pim} could be replaced by a compiler-based approach.
A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands. A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands and the provided arithmetic operation.
This extension might also make use of so-called non-temporal instructions that bypass the cache hierarchy on a per-instruction basis. This extension could also make use of so-called non-temporal instructions, which bypass the cache hierarchy on a per-instruction basis instead of preallocating the entire \ac{pim}-enabled memory as non-cacheable.
In conclusion, \ac{pim} is a promising approach to address the future processing needs of \ac{ai} and possibly other applications. In addition to the performance comparison, further research should also model and compare the power efficiency gain of \ac{pim} to the non-\ac{pim} case.
Not only the architecture itself has to be considered, but also the integration of \ac{pim} into the applications at the software level. Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually drive the memory data bus during operation, it promises good improvements in this area.
However, this would require a detailed performance model of both \aca{hbm} and \aca{fimdram}.
In conclusion, \ac{pim} is a promising approach to address the future processing and power efficiency needs of \ac{ai} and possibly other applications.
Research needs to consider not only the architecture itself, but also the integration of \ac{pim} into applications at the software level.
By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms. By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms.
% what to do better: % what to do better:

View File

@@ -128,6 +128,6 @@ In the center of the die, the \acp{tsv} connect the die to the next die above it
\end{figure} \end{figure}
% still, bandwidth requirements of new AI applications are not met by HBM2:waq % still, bandwidth requirements of new AI applications are not met by HBM2:waq
Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bounded limitations. Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bound limitations.
While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}. While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}.
Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor for performing operations on it. Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor for performing operations on it.

View File

@@ -6,7 +6,8 @@ To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to b
They also need to be provided it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode. They also need to be provided it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller. However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller.
In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}. In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
As a result, the \aca{fimdram} model itself does not need to model any timing behavior: its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys. As a result, the \aca{fimdram} model itself does not need to model any timing behavior:
Its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys.
This leads to a significantly simplified model, since the internal pipeline stages of \aca{fimdram} do not need to be modeled, but only the functional behavior of a processing unit to the outside. This leads to a significantly simplified model, since the internal pipeline stages of \aca{fimdram} do not need to be modeled, but only the functional behavior of a processing unit to the outside.
While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory. While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory.
@@ -21,7 +22,7 @@ With more information from Samsung on how the actual mechanism is implemented, t
When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index. When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index.
This mode can be used by the host to initialize the input vector chunk interleaving as described in \cref{sec:memory_layout}, or to initialize the \ac{crf} of the processing unit with the microkernel, which should be the same for all operating banks. This mode can be used by the host to initialize the input vector chunk interleaving as described in \cref{sec:memory_layout}, or to initialize the \ac{crf} of the processing unit with the microkernel, which should be the same for all operating banks.
After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{ab}-\ac{pim} mode, which allows the execution of instructions in the processing units. After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units.
The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units. The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units.
In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor. In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor.
This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks. This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks.

View File

@@ -28,8 +28,8 @@ In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} ha
However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself. However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself.
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}.
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.
\begin{figure}[!ht] \begin{figure}[!ht]
\centering \centering
@@ -41,9 +41,9 @@ In contrast, compute-intensive workloads, such as visual processing, are referre
In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements. In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements.
However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm} can provide \cite{kwon2021}. However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm} can provide \cite{kwon2021}.
All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bounded, new approaches to computing are required. All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bound, new approaches to computing are required.
This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}. This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}.
\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bounded problems. \Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bound problems.
This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors. This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors.
The remainder of this work is structured as follows: The remainder of this work is structured as follows:

View File

@@ -151,7 +151,7 @@ The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \
As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}. As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}.
As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor. As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor.
\Ac{hbm}-\ac{pim} defines three operating modes: \Aca{fimdram} defines three operating modes:
\begin{enumerate} \begin{enumerate}
\item \textbf{\Ac{sb} Mode}: \item \textbf{\Ac{sb} Mode}:
This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory. This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
@@ -165,13 +165,13 @@ As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\ti
In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation. In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
\end{enumerate} \end{enumerate}
Both in \ac{ab} mode and in \ac{ab}-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}. Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
\subsubsection{Processing Unit} \subsubsection{Processing Unit}
Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}. Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers. In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers.
The \ac{simd} \ac{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}. In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access. The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
One program that is stored in the \ac{crf} is called a \textit{microkernel}. One program that is stored in the \ac{crf} is called a \textit{microkernel}.
@@ -406,7 +406,7 @@ The following \cref{sec:vp} introduces the concept of virtual prototyping, which
\begin{landscape} \begin{landscape}
\begin{figure} \begin{figure}
\input{images/matrix_layout} \input{images/matrix_layout}
\caption[Mapping of the weight matrix onto the memory banks and its layout in the linear address space.]{Mapping of the weight matrix onto the memory banks and its layout in the linear address space.} \caption{Mapping of the weight matrix onto the memory banks and its layout in the linear address space.}
\label{img:matrix_layout} \label{img:matrix_layout}
\end{figure} \end{figure}
\end{landscape} \end{landscape}

View File

@@ -83,7 +83,7 @@ This allows an exaggerated evaluation of the performance gains of \ac{pim} in an
% dann HAXPY % dann HAXPY
The first set of benchmarks analyzes the speedup of \aca{fimdram} for various vector operations, namely an element-wise vector add operation (VADD), an element-wise vector multiply operation (VMUL), and a \ac{haxpy} operation. The first set of benchmarks analyzes the speedup of \aca{fimdram} for various vector operations, namely an element-wise vector add operation (VADD), an element-wise vector multiply operation (VMUL), and a \ac{haxpy} operation.
Such vector operations have a low operational density and are particularly memory-bounded because there is no data reuse at all and two input operands must be loaded for each operation. Such vector operations have a low operational density and are particularly memory-bound because there is no data reuse at all and two input operands must be loaded for each operation.
As a result, the on-chip cache does not accelerate such workloads because all operand data must be fetched from memory anyway. As a result, the on-chip cache does not accelerate such workloads because all operand data must be fetched from memory anyway.
The workloads adhere to the following calculation patterns: The workloads adhere to the following calculation patterns:
@@ -148,7 +148,7 @@ As all speedup values are well above 1, it can be concluded that even the smalle
\end{figure} \end{figure}
In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}. In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}.
As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{10.2}{17.6}{\times}$ lower than in the generic system. As it can be seen, the achievable speedup in the completely memory-bound system is with a range of $\qtyrange{10.2}{17.6}{\times}$ lower than in the generic system.
This is expected as the system becomes completely memory-bound and no longer relies on the relatively slow ARM processor. This is expected as the system becomes completely memory-bound and no longer relies on the relatively slow ARM processor.
The variance in speedup between different vector dimensions is also fairly low. The variance in speedup between different vector dimensions is also fairly low.
% For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows: % For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows:
@@ -263,7 +263,7 @@ Therefore, the simulations can be directly compared to gain a good understanding
Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective. Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective.
All the microbenchmarks discussed so far do not perform batching, so all comparisons are made against the results for the batch size of 1, which correspond to the blue bars in \cref{fig:samsung_speedup}. All the microbenchmarks discussed so far do not perform batching, so all comparisons are made against the results for the batch size of 1, which correspond to the blue bars in \cref{fig:samsung_speedup}.
Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the \ac{cpu} approach, as both systems should be operating in the memory-bounded region. Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the \ac{cpu} approach, as both systems should be operating in the memory-bound region.
\begin{figure} \begin{figure}
\centering \centering
@@ -284,9 +284,16 @@ In summary, the results for the VADD workload show some deviation from the real-
\subsubsection{Comparison to Real Hardware} \subsubsection{Comparison to Real Hardware}
In addition to comparing Samsung's real hardware implementation, the same benchmarks of the simulations performed are run on two real \ac{gpu} systems, here referred to as Vega and Tesla. In addition to comparing Samsung's real hardware implementation, the same benchmarks of the simulations performed are run on two real \ac{gpu} systems, here referred to as Vega and Tesla.
The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from Nvidia, specifically tailored for \ac{hpc}. The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from NVIDIA, specifically tailored for \ac{hpc}.
Both \acp{gpu} make use of \aca{hbm} and therefore are greatly suited to classify the simulation results and get an overview of the workload runtimes on a real system. This Tesla \ac{gpu} is only one of the in total 16 \acp{gpu} that are part of the NVIDIA DG-X2 \ac{ai} workstation.
As both systems are using generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case. Both \acp{gpu} make use of \aca{hbm} and are therefore well suited to classify the simulation results and get an overview of the workload runtimes on a real system.
Since both systems use generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements should only be used as a rough estimate of the runtimes in a non-\ac{pim} case.
The Vega \ac{gpu} integrates $\qty{8}{\gibi\byte}$ of \aca{hbm} memory using two stacks, achieving a complete bus width of $\qty{2048}{\bit}$ with a total of 16 memory channels.
For the theoretical performance of \ac{fp16} operations, Vega achieves a value of $\qty{21.09}{\peta FLOPS}$ \cite{vega2017}.
While being specifically tailored for \ac{ai} applications, the Tesla \ac{gpu} integrates $\qty{32}{\gibi\byte}$ of \aca{hbm} using four stacks, resulting in a total bus width of $\qty{4096}{\bit}$ and 32 independent memory channels.
The theoretical performance of \ac{fp16} operations is only slightly higher than the of the Vega \ac{gpu} with a value of $\qty{28.26}{\peta FLOPS}$ \cite{tesla2018}.
\begin{figure} \begin{figure}
\centering \centering
@@ -306,6 +313,16 @@ As both systems are using generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the
\label{fig:runtimes_matrix} \label{fig:runtimes_matrix}
\end{figure} \end{figure}
A comparison between all investigated systems for the vector benchmarks is shown in \cref{fig:runtimes_vector}.
As it can be seen, for both the generic ARM system, and the infinite compute system, the usage of \ac{pim} reduces the total runtime significantly.
However, when comparing the two \ac{gpu} systems with the infinite compute approach, it can be seen that the runtime of the \acp{gpu} is actually lower, even though the infinite compute approach should be an optimal memory-bound system.
It is important to note that while the simulation considered only one memory channel, the real \acp{gpu} could distribute all operations over all available channels, 16 channels in the case of Vega and 32 channels in the case of Tesla.
This results in significantly higher overall memory bandwidth for the \ac{gpu} systems.
It should also be noted that the \aca{hbm} memory of the Tesla \ac{gpu} is clocked at a slightly higher frequency of $\qty{876}{\mega\hertz}$ than the Vega \ac{gpu} with a frequency of $\qty{800}{\mega\hertz}$ \cite{vega2017,tesla2018}.
In addition, while the infinite compute system does not use any computing time, it may still need to stall and wait for memory requests to complete.
This is especially true for the explicitly inserted memory barriers in the \ac{pim} kernels.
With these things in mind, the faster execution time of \ac{gpu} systems compared to the infinite compute system may be explained this way.
% \subsubsection{Initialization Overhead} % \subsubsection{Initialization Overhead}
% conversion der operanden im verhältnis zur laufzeit abschätzen % conversion der operanden im verhältnis zur laufzeit abschätzen

Binary file not shown.

View File

@@ -1,5 +1,5 @@
level,vadd,vmul,haxpy,gemv,dnn level,vadd,vmul,haxpy,gemv,dnn
X1,69572650,67408281,69791189,750246152,231093065 X1,17282586,17180880,17121019,16984363,91489038
X2,123217536,103994272,123543145,648714601,431703456 X2,31633105,31633649,31802257,26425737,151112206
X3,207693503,182162140,207947543,2454455479,877622611 X3,60059785,60065489,60021288,86860818,142148495
X4,378089165,350280326,377434890,4968984949,2175751385 X4,116919805,116812209,116847802,166443969,89307502
1 level vadd vmul haxpy gemv dnn
2 X1 69572650 17282586 67408281 17180880 69791189 17121019 750246152 16984363 231093065 91489038
3 X2 123217536 31633105 103994272 31633649 123543145 31802257 648714601 26425737 431703456 151112206
4 X3 207693503 60059785 182162140 60065489 207947543 60021288 2454455479 86860818 877622611 142148495
5 X4 378089165 116919805 350280326 116812209 377434890 116847802 4968984949 166443969 2175751385 89307502

View File

@@ -1,21 +1,21 @@
workload,level,vega,tesla workload,level,vega,tesla
VADD,X1,69572650,69572650 VADD,X1,69572650,17282586
VADD,X2,123217536,123217536 VADD,X2,123217536,31633105
VADD,X3,207693503,207693503 VADD,X3,207693503,60059785
VADD,X4,378089165,378089165 VADD,X4,378089165,116919805
VMUL,X1,67408281,67408281 VMUL,X1,67408281,17180880
VMUL,X2,103994272,103994272 VMUL,X2,103994272,31633649
VMUL,X3,182162140,182162140 VMUL,X3,182162140,60065489
VMUL,X4,350280326,350280326 VMUL,X4,350280326,116812209
HAXPY,X1,69791189,69791189 HAXPY,X1,69791189,17121019
HAXPY,X2,123543145,123543145 HAXPY,X2,123543145,31802257
HAXPY,X3,207947543,207947543 HAXPY,X3,207947543,60021288
HAXPY,X4,377434890,377434890 HAXPY,X4,377434890,116847802
GEMV,X1,750246152,750246152 GEMV,X1,750246152,16984363
GEMV,X2,648714601,648714601 GEMV,X2,648714601,26425737
GEMV,X3,2454455479,2454455479 GEMV,X3,2454455479,86860818
GEMV,X4,4968984949,4968984949 GEMV,X4,4968984949,166443969
DNN,X1,231093065,231093065 DNN,X1,231093065,91489038
DNN,X2,431703456,431703456 DNN,X2,431703456,151112206
DNN,X3,877622611,877622611 DNN,X3,877622611,142148495
DNN,X4,2175751385,2175751385 DNN,X4,2175751385,89307502
1 workload level vega tesla
2 VADD X1 69572650 69572650 17282586
3 VADD X2 123217536 123217536 31633105
4 VADD X3 207693503 207693503 60059785
5 VADD X4 378089165 378089165 116919805
6 VMUL X1 67408281 67408281 17180880
7 VMUL X2 103994272 103994272 31633649
8 VMUL X3 182162140 182162140 60065489
9 VMUL X4 350280326 350280326 116812209
10 HAXPY X1 69791189 69791189 17121019
11 HAXPY X2 123543145 123543145 31802257
12 HAXPY X3 207947543 207947543 60021288
13 HAXPY X4 377434890 377434890 116847802
14 GEMV X1 750246152 750246152 16984363
15 GEMV X2 648714601 648714601 26425737
16 GEMV X3 2454455479 2454455479 86860818
17 GEMV X4 4968984949 4968984949 166443969
18 DNN X1 231093065 231093065 91489038
19 DNN X2 431703456 431703456 151112206
20 DNN X3 877622611 877622611 142148495
21 DNN X4 2175751385 2175751385 89307502