Insert new simulation results

This commit is contained in:
2024-03-07 15:33:19 +01:00
parent 123c7e0b25
commit 4074a60f43
23 changed files with 214 additions and 184 deletions

View File

@@ -352,3 +352,7 @@
short = HAXPY, short = HAXPY,
long = half precision $a \cdot x + y$, long = half precision $a \cdot x + y$,
} }
\DeclareAcronym{hpc}{
short = HPC,
long = high-performance computing,
}

View File

@@ -3,8 +3,6 @@
\subsection{Simulation Results} \subsection{Simulation Results}
TODO !! nochmal aktualisieren!
\begin{table}[H] \begin{table}[H]
\csvreader[ \csvreader[
head to column names, head to column names,

View File

@@ -13,7 +13,7 @@ For this, more detailed information is required from Samsung, as the exact inter
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel. To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications. In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver. Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$. To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$.
For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications. For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications.
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment. Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.

View File

@@ -53,25 +53,25 @@ In the attributes of the page table, each mapped block of address space can be a
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}. While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed. Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
In the AArch64 execution mode, the operating system can choose from three different granule sizes for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$. In the AArch64 execution mode, the operating system can choose from three different granule sizes for the translation tables: $\qty{4}{\kibi\byte}$, $\qty{16}{\kibi\byte}$ and $\qty{64}{\kibi\byte}$.
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}. Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kibi\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
\begin{figure} \begin{figure}
\centering \centering
\includegraphics[width=\linewidth]{images/pagetable_granule} \includegraphics[width=\linewidth]{images/pagetable_granule}
\caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule.]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.} \caption[The distinct page table levels for the $\qty{4}{\kibi\byte}$ granule.]{The distinct page table levels for the $\qty{4}{\kibi\byte}$ granule \cite{arm2015}.}
\label{img:pagetable_granule} \label{img:pagetable_granule}
\end{figure} \end{figure}
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table. As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table. In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely. In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this is not necessary. While regular operating systems may use the complete $\qty{4}{\kibi\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this is not necessary.
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks. For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\gibi\byte}$ memory blocks.
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated. In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again. To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
After setting up the page tables, initializing the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and assigning the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application. After setting up the page tables, initializing the \ac{tcr} to enable the $\qty{4}{\kibi\byte}$, and assigning the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
\subsubsection{Bare-Metal Utilities} \subsubsection{Bare-Metal Utilities}
% Heap Allocator (linked list allocator?...) % Heap Allocator (linked list allocator?...)

View File

@@ -82,7 +82,7 @@ In the following, three \ac{pim} approaches that place the compute units at the
The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}. The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips. UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory. In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mebi\byte}$ memory bank, a $\qty{24}{\kibi\byte}$ instruction memory and a $\qty{64}{\kibi\byte}$ scratchpad memory.
The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results. The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices. While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer. UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer.

View File

@@ -17,6 +17,7 @@ Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$. In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel. Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
% some implementation details % some implementation details
% hbm size, channel... % hbm size, channel...
@@ -92,9 +93,11 @@ The workloads adhere to the following calculation patterns:
\item \ac{haxpy}: $z = a \cdot x + y$ \item \ac{haxpy}: $z = a \cdot x + y$
\end{itemize} \end{itemize}
Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable. Each workload is run with four different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable.
\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks. \Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks.
The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 256, which is the minimum size that can be represented in a \ac{pim} data structure. The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 2097152.
To accurately evaluate the performance gain of \ac{pim}, it is important that the size of the input operand is significantly larger than the cache size of the simulated system, so that the cache does not filter the memory accesses to the \ac{dram}.
In the case of the smallest dimension level, the effective data size of the input operands is $2^{21} \cdot 2 \cdot \qty{2}{\byte}=\qty{8}{\mebi\byte}$, which is much larger than the last-level cache of $\qty{256}{\kibi\byte}$.
\begin{table} \begin{table}
\centering \centering
@@ -110,10 +113,10 @@ The levels X1-X4 denote the increasing dimensions, with each successive level do
hline{2} = {2}{-}{solid,black}, hline{2} = {2}{-}{solid,black},
} }
Level & Vector Dimensions \\ Level & Vector Dimensions \\
X1 & (256 $\times$ 1) \\ X1 & $(2^{21})=(2 \textrm{M})$ \\
X2 & (512 $\times$ 1) \\ X2 & $(2^{22})=(4 \textrm{M})$ \\
X3 & (1024 $\times$ 1) \\ X3 & $(2^{23})=(8 \textrm{M})$ \\
X4 & (2048 $\times$ 1) X4 & $(2^{24})=(16 \textrm{M})$
\end{tblr} \end{tblr}
\caption{List of the input vector dimensions for the vector benchmarks.} \caption{List of the input vector dimensions for the vector benchmarks.}
\label{tab:dimensions_vector} \label{tab:dimensions_vector}
@@ -131,10 +134,11 @@ S = \frac{\textrm{\#ticks in non-\ac{pim} mode}}{\textrm{\#ticks in \ac{pim} mod
\label{fig:vector_normal} \label{fig:vector_normal}
\end{figure} \end{figure}
\Cref{fig:vector_normal} shows the relative performance for the vector benchmarks, running on the generic ARM-based system at a typical clock frequency. \Cref{fig:vector_normal} shows the relative performance for the vector benchmarks, running on the generic ARM-based system at a typical clock frequency of $\qty{3}{\giga\hertz}$.
The relative speedup of \ac{pim} is in the range of about $\qtyrange{12.8}{31.8}{\times}$ with limited variance for each benchmark between the different vector dimensions, since such vector operations essentially scale linearly with the length of the input operands for both the non-\ac{pim} and \ac{pim} approaches. The relative speedup of \ac{pim} is in the range of about $\qtyrange{13.6}{23.9}{\times}$ with very small variance between the different vector dimensions, because such vector operations essentially scale linearly with the length of the input operands for both the non-\ac{pim} and \ac{pim} approaches.
The \ac{haxpy} benchmark has the highest variance with a range of $\qtyrange{19.8}{31.8}{\times}$, which is due to the fact that each value of the one input vector must first be multiplied by a scalar amount on the \ac{cpu} before the addition operation, while in the \ac{pim} case the specialized \ac{mad} instruction is used. The \ac{haxpy} benchmark has the highest speedup compared to the VADD and VMUL benchmarks with up to $\qty{23.9}{\times}$.
As all speedup values are well above 1, it can be concluded that even the smallest representable vector size of 256 is already above the break-even point at which \ac{pim} becomes viable. This is due to the fact that in the non-\ac{pim} system each value of the one input vector must first be multiplied by a scalar amount on the \ac{cpu} before the addition operation can take place, while in the \ac{pim} case the specialized \ac{mad} instruction is used, where these two operations are done in a single instruction.
As all speedup values are well above 1, it can be concluded that even the smallest benchmarked vector size is already well above the break-even point at which \ac{pim} becomes viable.
\begin{figure} \begin{figure}
\centering \centering
@@ -144,13 +148,11 @@ As all speedup values are well above 1, it can be concluded that even the smalle
\end{figure} \end{figure}
In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}. In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}.
As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{1.7}{2.4}{\times}$ lower than in the generic system. As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{10.2}{17.6}{\times}$ lower than in the generic system.
The variance of the speedup between the different vector dimensions are also rather small. This is expected as the system becomes completely memory-bound and no longer relies on the relatively slow ARM processor.
For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows: The variance in speedup between different vector dimensions is also fairly low.
The additional computation step of the scalar multiplication does not affect the non-\ac{pim} system as much as in the previous case, because this is insignificant compared to the memory fetch of the vector elements. % For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows:
% The additional computation step of the scalar multiplication does not affect the non-\ac{pim} system as much as in the previous case, because this is insignificant compared to the memory fetch of the vector elements.
% vectors: im wesentlichen skaliert beides mit der länge es vecktors, minimal weniger overhead
% haxpy: skalarmultiplikation macht CPU bedeutend langsamer, deswegen fällt dieser unterscheid bei 100GHz auch weg
\subsubsection{Neural Network Layers} \subsubsection{Neural Network Layers}
% GEMV % GEMV
@@ -160,10 +162,10 @@ The additional computation step of the scalar multiplication does not affect the
% GEMM mit stark interleavten matrizen (eher nicht) % GEMM mit stark interleavten matrizen (eher nicht)
In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also investigated for the level 2 \ac{blas} routine \ac{gemv}. In addition to the simple vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also investigated for the level 2 \ac{blas} routine \ac{gemv}.
Besides the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network. Besides the benchmark for the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network in the \ac{dnn} benchmark.
Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer. Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer.
In the simplest form, quadratic matrix dimensions ensure that the output vector of each layer has the same dimensions as the input vector, which simplifies the chaining in the benchmark. In the simplest form, quadratic matrix dimensions ensure that the output vector of each layer has the same dimensions as the input vector, which simplifies the chaining of the outputs as inputs.
Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}. Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}.
\begin{table} \begin{table}
@@ -184,16 +186,17 @@ Again, several different dimensions of the benchmark inputs are used, whose matr
hline{2} = {2}{-}{solid,black}, hline{2} = {2}{-}{solid,black},
} }
Level & \ac{gemv} Matrix Dimensions & \ac{dnn} Matrix Dimensions \\ Level & \ac{gemv} Matrix Dimensions & \ac{dnn} Matrix Dimensions \\
X1 & (128 $\times$ 128) & (128 $\times$ 128) \\ X1 & (1024 $\times$ 4096) & (256 $\times$ 256) \\
X2 & (256 $\times$ 128) & (256 $\times$ 256) \\ X2 & (2048 $\times$ 4096) & (512 $\times$ 512) \\
X3 & (512 $\times$ 128) & (512 $\times$ 512) \\ X3 & (4096 $\times$ 8192) & (1024 $\times$ 1024) \\
X4 & (1024 $\times$ 128) & (1024 $\times$ 1024) X4 & (8192 $\times$ 8192) & (2048 $\times$ 2048)
\end{tblr} \end{tblr}
\caption{List of the matrix dimensions for the neural network benchmarks.} \caption{List of the matrix dimensions for the neural network benchmarks.}
\label{tab:dimensions_matrix} \label{tab:dimensions_matrix}
\end{table} \end{table}
In the \ac{gemv} benchmarks, only the number of rows is increased at each step, which means that the \ac{pim} microkernel has to perform more iterations of the \ac{mac} kernel, but does not have to load another chunk of the input vector, since it fits completely into the \ac{grf}-A registers. % In the \ac{gemv} benchmarks, only the number of rows is increased at each step, which means that the \ac{pim} microkernel has to perform more iterations of the \ac{mac} kernel, but does not have to load another chunk of the input vector, since it fits completely into the \ac{grf}-A registers.
For the various \ac{gemv} benchmarks, both the number of rows and the number of columns are increased with each step, which means that the \ac{pim} microkernel has to not only perform more iterations of the \ac{mac} kernel to produce the partial sum, but also has to load different chunks of the input vector since it does not fit completely into the \acs{grf}-A registers.
\begin{figure}[ht] \begin{figure}[ht]
\centering \centering
@@ -202,12 +205,12 @@ In the \ac{gemv} benchmarks, only the number of rows is increased at each step,
\label{fig:matrix_normal} \label{fig:matrix_normal}
\end{figure} \end{figure}
\Cref{fig:matrix_normal} shows the relative performance for the \ac{gemv} benchmarks that are run on the system at a normal clock speed. \Cref{fig:matrix_normal} shows the relative performance for the \ac{gemv} benchmarks that are run on the generic ARM system.
The speedup for a single \ac{gemv} operation is in the range of $\qtyrange{3.5}{23.6}{\times}$ and for the simple \ac{dnn} layers $\qtyrange{3.0}{72.3}{\times}$. The speedup for a single \ac{gemv} operation is in the range of $\qtyrange{56.0}{62.5}{\times}$ and for the simple \ac{dnn} layers in the range of $\qtyrange{10.7}{49.3}{\times}$.
Unlike in the vector benchmarks, the performance gains become drastically more significant with increasing matrix dimensions, where \ac{pim} can exploit its specialized architecture for this type of operation. Unlike the vector benchmarks, the performance gains, especially for the \ac{dnn} benchmark, become drastically more significant with increasing matrix dimensions, where \ac{pim} can take advantage of its specialized architecture for this type of operation.
A possible explanation is that the initial overhead of executing the microkernel in the \aca{fimdram} processing units quickly becomes insignificant with increasing operand dimensions compared to the actual execution time. A possible explanation is that the initial overhead of executing the microkernel in the \aca{fimdram} processing units quickly becomes insignificant with increasing operand dimensions compared to the actual execution time.
Also, in all cases, the smallest representable operand dimensions already achieve a speedup of over one, suggesting that the break-even point of \ac{pim}'s viability for this system is below these dimensions. Also, in all cases, the smallest representable operand dimensions already achieve a speedup of over one, suggesting that the break-even point of \ac{pim}'s viability for this system is below these dimensions.
Since the speedup approaches $\qty{100}{\times}$ in the \ac{dnn} benchmark, it can be concluded that \ac{pim} offers an immense performance advantage in this system configuration. Since the speedup for \ac{gemv} approaches $\qty{63}{\times}$ and for \ac{dnn} $\qty{50}{\times}$, it can be concluded that \ac{pim} offers an extreme performance advantage in this system configuration.
\begin{figure} \begin{figure}
\centering \centering
@@ -216,21 +219,22 @@ Since the speedup approaches $\qty{100}{\times}$ in the \ac{dnn} benchmark, it c
\label{fig:matrix_infinite} \label{fig:matrix_infinite}
\end{figure} \end{figure}
The \ac{gemv} and \ac{dnn} benchmarks, however show a more differentiated view for the infinite compute approach that models the completely memory-bounded system: For the infinite compute approach, the \ac{gemv} and \ac{dnn} benchmarks however show a more differentiated view:
For smaller matrix dimensions, the usage of \ac{pim} slows the execution down up to a factor of $\qty{0.21}{\times}$ for the \ac{gemv} benchmark and even $\qty{0.18}{\times}$ for the \ac{dnn} layers. While the \ac{gemv} benchmark plateaus at around $\qty{9}{\times}$ for all matrix sizes, the usage of \ac{pim} slows the execution down up to a factor of $\qty{0.56}{\times}$ for the \ac{dnn} benchmark.
However, the speedup quickly increases with the larger dimensions, reaches its break-even point at the third step and shows a maximum speedup of $\qty{4.7}{\times}$ and $\qty{6.1}{\times}$ for the \ac{gemv} and \ac{dnn} benchmark respectively. However, the speedup quickly increases with the matrix larger dimensions, reaches its break-even point at the second step and shows a maximum speedup of $\qty{9.2}{\times}$ and $\qty{6.0}{\times}$ for the \ac{gemv} and \ac{dnn} benchmarks respectively.
These results provide a more realistic view of \aca{fimdram}: These results provide a more realistic view of \aca{fimdram}:
For workloads and accelerator systems that are truly memory-bound, performance improvements can be on the order of the simulated $\qty{6.1}{\times}$. For workloads and accelerator systems that are truly memory-bound, performance improvements can be on the order of the simulated $\qty{9}{\times}$.
This result is largely in line with the numbers published by Samsung, which were already introduced in \cref{sec:fimdram_performance} and will be compared in more detail with the simulation results in the next section. This result is largely in line with the numbers published by Samsung, which were already introduced in \cref{sec:fimdram_performance} and will be compared in more detail with the simulation results in the next section.
\subsubsection{Comparison to Samsung's Simulation Results} \subsubsection{Comparison to Samsung's Simulation Results}
To reiterate, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \aca{fimdram} memory packages. To reiterate, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \aca{fimdram} memory packages.
Similarly to the above investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}. Similarly to the previous investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}.
\begin{table} \begin{table}
\centering \centering
\begin{tblr}{ \begin{tblr}{
row{1} = {c},
cell{2}{2} = {r}, cell{2}{2} = {r},
cell{3}{2} = {r}, cell{3}{2} = {r},
cell{4}{2} = {r}, cell{4}{2} = {r},
@@ -254,9 +258,12 @@ Level 4 & (8k $\times$ 8k) & (16M)
\label{tab:samsung_dimensions} \label{tab:samsung_dimensions}
\end{table} \end{table}
Each simulation is run with different batch sizes, where a higher batch size allows for better cache utilization, as multiple operations are performed on the same data set, making the workload less memory bound and rendering \ac{pim} less effective. As can be seen, the dimensions for the \ac{gemv} benchmark and the vector add operations, which corresponds to the VADD benchmark of this thesis, match the dimensions used in the previously discussed simulations.
All the microbenchmarks discussed so far do not perform batching, so all comparisons are performed on the result values for the batch size of 1, which correspond with the blue bars in \cref{fig:samsung_speedup}. Therefore, the simulations can be directly compared to gain a good understanding of how accurate they are in comparison to the real system manufactured by Samsung.
Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the limited \ac{cpu} approach, as both systems should operate in the memory-bounded region.
Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective.
All the microbenchmarks discussed so far do not perform batching, so all comparisons are made against the results for the batch size of 1, which correspond to the blue bars in \cref{fig:samsung_speedup}.
Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the \ac{cpu} approach, as both systems should be operating in the memory-bounded region.
\begin{figure} \begin{figure}
\centering \centering
@@ -266,22 +273,27 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc
\end{figure} \end{figure}
The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
When compared to the simulated platform, the variance is also limited with a range of $\qtyrange{1.6}{2.4}{\times}$, which corresponds well with the findings of Samsung. When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung.
The \ac{gemv} microbenchmark on the other hand shows a more drastic speedup with an average value of $\qty{8.3}{\times}$. This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
Although the dimensions used by Samsung are different from the simulations of this thesis, the highest achieved speedup of $\qty{6.1}{\times}$ is well within the reach of the real hardware implementation. Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after 8 \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.
The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation.
In summary, the results for the VADD workload show some deviation from the real-world implementation of the system, while the \ac{gemv} workload shows a result that is consistent with it.
\subsubsection{Comparison to Real Hardware} \subsubsection{Comparison to Real Hardware}
TODO: check all ranges
In addition to the comparison of Samsung's real hardware implementation, the same benchmarks of the performed simulations are run on a [...] with HBM2 [...]. In addition to comparing Samsung's real hardware implementation, the same benchmarks of the simulations performed are run on two real \ac{gpu} systems, here referred to as Vega and Tesla.
As this system is using a generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case. The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from Nvidia, specifically tailored for \ac{hpc}.
Both \acp{gpu} make use of \aca{hbm} and therefore are greatly suited to classify the simulation results and get an overview of the workload runtimes on a real system.
As both systems are using generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case.
\begin{figure} \begin{figure}
\centering \centering
\resizebox{\linewidth}{!}{% \resizebox{\linewidth}{!}{%
\input{plots/runtimes_vector} \input{plots/runtimes_vector}
} }
\caption{} \caption{Runtimes of all investigated systems for the vector benchmarks.}
\label{fig:runtimes_vector} \label{fig:runtimes_vector}
\end{figure} \end{figure}
@@ -290,7 +302,7 @@ As this system is using a generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the
% \resizebox{\linewidth}{!}{% % \resizebox{\linewidth}{!}{%
\input{plots/runtimes_matrix} \input{plots/runtimes_matrix}
% } % }
\caption{} \caption{Runtimes of all investigated systems for the matrix benchmarks.}
\label{fig:runtimes_matrix} \label{fig:runtimes_matrix}
\end{figure} \end{figure}

View File

@@ -12,8 +12,8 @@
editor = {Cuesta, Carlos E. and Garlan, David and Pérez, Jennifer}, editor = {Cuesta, Carlos E. and Garlan, David and Pérez, Jennifer},
date = {2018}, date = {2018},
pages = {115--130}, pages = {115--130},
publisher = {{Springer International Publishing}}, publisher = {Springer International Publishing},
location = {{Cham}}, location = {Cham},
abstract = {Continuous software engineering aims at orchestrating engineering knowledge from various disciplines in order to deal with the rapid changes within the ecosystems of which software-based systems are part of. The literature claims that one means to ensure these prompt responses is to incorporate virtual prototypes of the system as early as possible in the development process, such that requirements and architecture decisions are verified early and continuously by means of simulations. Despite the maturity of practices for designing and assessing architectures, as well as for virtual prototyping, it is still not clear how to jointly consider the practices from these disciplines within development processes, in order to address the dynamics imposed by continuous software engineering. In this regard, we discuss in this paper how to orchestrate architecture drivers and design specification techniques with virtual prototypes, to address the demands of continuous software engineering in development processes. Our proposals are based on experiences from research and industry projects in various domains such as automotive, agriculture, construction, and medical devices.}, abstract = {Continuous software engineering aims at orchestrating engineering knowledge from various disciplines in order to deal with the rapid changes within the ecosystems of which software-based systems are part of. The literature claims that one means to ensure these prompt responses is to incorporate virtual prototypes of the system as early as possible in the development process, such that requirements and architecture decisions are verified early and continuously by means of simulations. Despite the maturity of practices for designing and assessing architectures, as well as for virtual prototyping, it is still not clear how to jointly consider the practices from these disciplines within development processes, in order to address the dynamics imposed by continuous software engineering. In this regard, we discuss in this paper how to orchestrate architecture drivers and design specification techniques with virtual prototypes, to address the demands of continuous software engineering in development processes. Our proposals are based on experiences from research and industry projects in various domains such as automotive, agriculture, construction, and medical devices.},
isbn = {978-3-030-00761-4}, isbn = {978-3-030-00761-4},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGD8N29E/Antonino et al. - 2018 - Enabling Continuous Software Engineering for Embed.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGD8N29E/Antonino et al. - 2018 - Enabling Continuous Software Engineering for Embed.pdf}
@@ -53,7 +53,7 @@
date = {2023-11-10}, date = {2023-11-10},
url = {https://dvcon-europe.org/wp-content/uploads/sites/14/2023/12/Keynote-Pervasive-and-Sustainable-AI-with-Adaptive.pdf}, url = {https://dvcon-europe.org/wp-content/uploads/sites/14/2023/12/Keynote-Pervasive-and-Sustainable-AI-with-Adaptive.pdf},
urldate = {2024-01-23}, urldate = {2024-01-23},
venue = {{DVCon 2023}} venue = {DVCon 2023}
} }
@online{chen2023, @online{chen2023,
@@ -168,8 +168,8 @@
author = {He, Mingxuan and Song, Choungki and Kim, Ilkon and Jeong, Chunseok and Kim, Seho and Park, Il and Thottethodi, Mithuna and Vijaykumar, T. N.}, author = {He, Mingxuan and Song, Choungki and Kim, Ilkon and Jeong, Chunseok and Kim, Seho and Park, Il and Thottethodi, Mithuna and Vijaykumar, T. N.},
date = {2020-10}, date = {2020-10},
pages = {372--385}, pages = {372--385},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{Athens, Greece}}, location = {Athens, Greece},
doi = {10.1109/MICRO50266.2020.00040}, doi = {10.1109/MICRO50266.2020.00040},
url = {https://ieeexplore.ieee.org/document/9251855/}, url = {https://ieeexplore.ieee.org/document/9251855/},
urldate = {2024-01-09}, urldate = {2024-01-09},
@@ -191,8 +191,8 @@
shorttitle = {Memory Systems}, shorttitle = {Memory Systems},
author = {Jacob, Bruce and Ng, Spencer W. and Wang, David T. and Wang, David and Rodriguez, Samuel}, author = {Jacob, Bruce and Ng, Spencer W. and Wang, David T. and Wang, David and Rodriguez, Samuel},
date = {2008}, date = {2008},
publisher = {{Elsevier/Morgan Kaufmann}}, publisher = {Elsevier/Morgan Kaufmann},
location = {{Amsterdam Heidelberg}}, location = {Amsterdam Heidelberg},
isbn = {978-0-12-379751-3}, isbn = {978-0-12-379751-3},
langid = {english}, langid = {english},
pagetotal = {982}, pagetotal = {982},
@@ -219,8 +219,8 @@
author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun}, author = {Jouppi, Norman P. and Young, Cliff and Patil, Nishant and Patterson, David and Agrawal, Gaurav and Bajwa, Raminder and Bates, Sarah and Bhatia, Suresh and Boden, Nan and Borchers, Al and Boyle, Rick and Cantin, Pierre-luc and Chao, Clifford and Clark, Chris and Coriell, Jeremy and Daley, Mike and Dau, Matt and Dean, Jeffrey and Gelb, Ben and Ghaemmaghami, Tara Vazir and Gottipati, Rajendra and Gulland, William and Hagmann, Robert and Ho, C. Richard and Hogberg, Doug and Hu, John and Hundt, Robert and Hurt, Dan and Ibarz, Julian and Jaffey, Aaron and Jaworski, Alek and Kaplan, Alexander and Khaitan, Harshit and Killebrew, Daniel and Koch, Andy and Kumar, Naveen and Lacy, Steve and Laudon, James and Law, James and Le, Diemthu and Leary, Chris and Liu, Zhuyuan and Lucke, Kyle and Lundin, Alan and MacKean, Gordon and Maggiore, Adriana and Mahony, Maire and Miller, Kieran and Nagarajan, Rahul and Narayanaswami, Ravi and Ni, Ray and Nix, Kathy and Norrie, Thomas and Omernick, Mark and Penukonda, Narayana and Phelps, Andy and Ross, Jonathan and Ross, Matt and Salek, Amir and Samadiani, Emad and Severn, Chris and Sizikov, Gregory and Snelham, Matthew and Souter, Jed and Steinberg, Dan and Swing, Andy and Tan, Mercedes and Thorson, Gregory and Tian, Bo and Toma, Horia and Tuttle, Erick and Vasudevan, Vijay and Walter, Richard and Wang, Walter and Wilcox, Eric and Yoon, Doe Hyun},
date = {2017-06-24}, date = {2017-06-24},
pages = {1--12}, pages = {1--12},
publisher = {{ACM}}, publisher = {ACM},
location = {{Toronto ON Canada}}, location = {Toronto ON Canada},
doi = {10.1145/3079856.3080246}, doi = {10.1145/3079856.3080246},
url = {https://dl.acm.org/doi/10.1145/3079856.3080246}, url = {https://dl.acm.org/doi/10.1145/3079856.3080246},
urldate = {2024-01-22}, urldate = {2024-01-22},
@@ -235,7 +235,7 @@
author = {Jung, Matthias}, author = {Jung, Matthias},
date = {2017}, date = {2017},
series = {Forschungsberichte {{Mikroelektronik}}}, series = {Forschungsberichte {{Mikroelektronik}}},
publisher = {{Technische Universität Kaiserslautern}}, publisher = {Technische Universität Kaiserslautern},
isbn = {978-3-95974-051-7}, isbn = {978-3-95974-051-7},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/Y9YSTV6C/Jung - 2017 - System-level Modeling, Analysis and Optimization o.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/Y9YSTV6C/Jung - 2017 - System-level Modeling, Analysis and Optimization o.pdf}
} }
@@ -263,8 +263,8 @@
author = {Kal, Hongju and Yoo, Chanyoung and Ro, Won Woo}, author = {Kal, Hongju and Yoo, Chanyoung and Ro, Won Woo},
date = {2023-10-28}, date = {2023-10-28},
pages = {815--827}, pages = {815--827},
publisher = {{ACM}}, publisher = {ACM},
location = {{Toronto ON Canada}}, location = {Toronto ON Canada},
doi = {10.1145/3613424.3614314}, doi = {10.1145/3613424.3614314},
url = {https://dl.acm.org/doi/10.1145/3613424.3614314}, url = {https://dl.acm.org/doi/10.1145/3613424.3614314},
urldate = {2024-01-08}, urldate = {2024-01-08},
@@ -282,8 +282,8 @@
author = {Kang, Shinhaeng and Lee, Sukhan and Kim, Byeongho and Kim, Hweesoo and Sohn, Kyomin and Kim, Nam Sung and Lee, Eojin}, author = {Kang, Shinhaeng and Lee, Sukhan and Kim, Byeongho and Kim, Hweesoo and Sohn, Kyomin and Kim, Nam Sung and Lee, Eojin},
date = {2022-02-13}, date = {2022-02-13},
pages = {146--152}, pages = {146--152},
publisher = {{ACM}}, publisher = {ACM},
location = {{Virtual Event USA}}, location = {Virtual Event USA},
doi = {10.1145/3490422.3502355}, doi = {10.1145/3490422.3502355},
url = {https://dl.acm.org/doi/10.1145/3490422.3502355}, url = {https://dl.acm.org/doi/10.1145/3490422.3502355},
urldate = {2024-01-08}, urldate = {2024-01-08},
@@ -301,8 +301,8 @@
author = {Kwon, Young-Cheon and Lee, Suk Han and Lee, Jaehoon and Kwon, Sang-Hyuk and Ryu, Je Min and Son, Jong-Pil and Seongil, O and Yu, Hak-Soo and Lee, Haesuk and Kim, Soo Young and Cho, Youngmin and Kim, Jin Guk and Choi, Jongyoon and Shin, Hyun-Sung and Kim, Jin and Phuah, BengSeng and Kim, HyoungMin and Song, Myeong Jun and Choi, Ahn and Kim, Daeho and Kim, SooYoung and Kim, Eun-Bong and Wang, David and Kang, Shinhaeng and Ro, Yuhwan and Seo, Seungwoo and Song, JoonHo and Youn, Jaeyoun and Sohn, Kyomin and Kim, Nam Sung}, author = {Kwon, Young-Cheon and Lee, Suk Han and Lee, Jaehoon and Kwon, Sang-Hyuk and Ryu, Je Min and Son, Jong-Pil and Seongil, O and Yu, Hak-Soo and Lee, Haesuk and Kim, Soo Young and Cho, Youngmin and Kim, Jin Guk and Choi, Jongyoon and Shin, Hyun-Sung and Kim, Jin and Phuah, BengSeng and Kim, HyoungMin and Song, Myeong Jun and Choi, Ahn and Kim, Daeho and Kim, SooYoung and Kim, Eun-Bong and Wang, David and Kang, Shinhaeng and Ro, Yuhwan and Seo, Seungwoo and Song, JoonHo and Youn, Jaeyoun and Sohn, Kyomin and Kim, Nam Sung},
date = {2021-02-13}, date = {2021-02-13},
pages = {350--352}, pages = {350--352},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{San Francisco, CA, USA}}, location = {San Francisco, CA, USA},
doi = {10.1109/ISSCC42613.2021.9365862}, doi = {10.1109/ISSCC42613.2021.9365862},
url = {https://ieeexplore.ieee.org/document/9365862/}, url = {https://ieeexplore.ieee.org/document/9365862/},
urldate = {2024-01-08}, urldate = {2024-01-08},
@@ -319,8 +319,8 @@
author = {Kwon, Yongkee and Vladimir, Kornijcuk and Kim, Nahsung and Shin, Woojae and Won, Jongsoon and Lee, Minkyu and Joo, Hyunha and Choi, Haerang and Kim, Guhyun and An, Byeongju and Kim, Jeongbin and Lee, Jaewook and Kim, Ilkon and Park, Jaehan and Park, Chanwook and Song, Yosub and Yang, Byeongsu and Lee, Hyungdeok and Kim, Seho and Kwon, Daehan and Lee, Seongju and Kim, Kyuyoung and Oh, Sanghoon and Park, Joonhong and Hong, Gimoon and Ka, Dongyoon and Hwang, Kyudong and Park, Jeongje and Kang, Kyeongpil and Kim, Jungyeon and Jeon, Junyeol and Lee, Myeongjun and Shin, Minyoung and Shin, Minhwan and Cha, Jaekyung and Jung, Changson and Chang, Kijoon and Jeong, Chunseok and Lim, Euicheol and Park, Il and Chun, Junhyun and Hynix, Sk}, author = {Kwon, Yongkee and Vladimir, Kornijcuk and Kim, Nahsung and Shin, Woojae and Won, Jongsoon and Lee, Minkyu and Joo, Hyunha and Choi, Haerang and Kim, Guhyun and An, Byeongju and Kim, Jeongbin and Lee, Jaewook and Kim, Ilkon and Park, Jaehan and Park, Chanwook and Song, Yosub and Yang, Byeongsu and Lee, Hyungdeok and Kim, Seho and Kwon, Daehan and Lee, Seongju and Kim, Kyuyoung and Oh, Sanghoon and Park, Joonhong and Hong, Gimoon and Ka, Dongyoon and Hwang, Kyudong and Park, Jeongje and Kang, Kyeongpil and Kim, Jungyeon and Jeon, Junyeol and Lee, Myeongjun and Shin, Minyoung and Shin, Minhwan and Cha, Jaekyung and Jung, Changson and Chang, Kijoon and Jeong, Chunseok and Lim, Euicheol and Park, Il and Chun, Junhyun and Hynix, Sk},
date = {2022-08-21}, date = {2022-08-21},
pages = {1--25}, pages = {1--25},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{Cupertino, CA, USA}}, location = {Cupertino, CA, USA},
doi = {10.1109/HCS55958.2022.9895629}, doi = {10.1109/HCS55958.2022.9895629},
url = {https://ieeexplore.ieee.org/document/9895629/}, url = {https://ieeexplore.ieee.org/document/9895629/},
urldate = {2024-01-22}, urldate = {2024-01-22},
@@ -336,8 +336,8 @@
author = {Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion}, author = {Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion},
date = {2023-10-23}, date = {2023-10-23},
pages = {611--626}, pages = {611--626},
publisher = {{ACM}}, publisher = {ACM},
location = {{Koblenz Germany}}, location = {Koblenz Germany},
doi = {10.1145/3600006.3613165}, doi = {10.1145/3600006.3613165},
url = {https://dl.acm.org/doi/10.1145/3600006.3613165}, url = {https://dl.acm.org/doi/10.1145/3600006.3613165},
urldate = {2024-01-12}, urldate = {2024-01-12},
@@ -354,8 +354,8 @@
author = {Lee, Sukhan and Kang, Shin-haeng and Lee, Jaehoon and Kim, Hyeonsu and Lee, Eojin and Seo, Seungwoo and Yoon, Hosang and Lee, Seungwon and Lim, Kyounghwan and Shin, Hyunsung and Kim, Jinhyun and Seongil, O and Iyer, Anand and Wang, David and Sohn, Kyomin and Kim, Nam Sung}, author = {Lee, Sukhan and Kang, Shin-haeng and Lee, Jaehoon and Kim, Hyeonsu and Lee, Eojin and Seo, Seungwoo and Yoon, Hosang and Lee, Seungwon and Lim, Kyounghwan and Shin, Hyunsung and Kim, Jinhyun and Seongil, O and Iyer, Anand and Wang, David and Sohn, Kyomin and Kim, Nam Sung},
date = {2021-06}, date = {2021-06},
pages = {43--56}, pages = {43--56},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{Valencia, Spain}}, location = {Valencia, Spain},
doi = {10.1109/ISCA52012.2021.00013}, doi = {10.1109/ISCA52012.2021.00013},
url = {https://ieeexplore.ieee.org/document/9499894/}, url = {https://ieeexplore.ieee.org/document/9499894/},
urldate = {2024-01-08}, urldate = {2024-01-08},
@@ -426,7 +426,7 @@
title = {Neural Networks and Deep Learning}, title = {Neural Networks and Deep Learning},
author = {Nielsen, Michael A.}, author = {Nielsen, Michael A.},
date = {2015}, date = {2015},
publisher = {{Determination Press}}, publisher = {Determination Press},
url = {http://neuralnetworksanddeeplearning.com/}, url = {http://neuralnetworksanddeeplearning.com/},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/E6FRVMZ3/Nielsen - 2015 - Neural networks and deep learning.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/E6FRVMZ3/Nielsen - 2015 - Neural networks and deep learning.pdf}
} }
@@ -460,7 +460,7 @@
shorttitle = {Processing in {{Memory}}}, shorttitle = {Processing in {{Memory}}},
author = {Radojković, Petar and Carpenter, Paul and Esmaili-Dokht, Pouya and Cimadomo, Rémy and Charles, Henri-Pierre and Sebastian, Abu and Amato, Paolo}, author = {Radojković, Petar and Carpenter, Paul and Esmaili-Dokht, Pouya and Cimadomo, Rémy and Charles, Henri-Pierre and Sebastian, Abu and Amato, Paolo},
date = {2021-07-29}, date = {2021-07-29},
institution = {{Zenodo}}, institution = {Zenodo},
doi = {10.5281/ZENODO.4767489}, doi = {10.5281/ZENODO.4767489},
url = {https://zenodo.org/record/4767489}, url = {https://zenodo.org/record/4767489},
urldate = {2024-02-06}, urldate = {2024-02-06},
@@ -493,8 +493,8 @@
author = {Samajdar, Ananda and Joseph, Jan Moritz and Zhu, Yuhao and Whatmough, Paul and Mattina, Matthew and Krishna, Tushar}, author = {Samajdar, Ananda and Joseph, Jan Moritz and Zhu, Yuhao and Whatmough, Paul and Mattina, Matthew and Krishna, Tushar},
date = {2020-08}, date = {2020-08},
pages = {58--68}, pages = {58--68},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{Boston, MA, USA}}, location = {Boston, MA, USA},
doi = {10.1109/ISPASS48437.2020.00016}, doi = {10.1109/ISPASS48437.2020.00016},
url = {https://ieeexplore.ieee.org/document/9238602/}, url = {https://ieeexplore.ieee.org/document/9238602/},
urldate = {2024-02-14}, urldate = {2024-02-14},
@@ -512,8 +512,8 @@
author = {Seshadri, Vivek and Kim, Yoongu and Fallin, Chris and Lee, Donghyuk and Ausavarungnirun, Rachata and Pekhimenko, Gennady and Luo, Yixin and Mutlu, Onur and Gibbons, Phillip B. and Kozuch, Michael A. and Mowry, Todd C.}, author = {Seshadri, Vivek and Kim, Yoongu and Fallin, Chris and Lee, Donghyuk and Ausavarungnirun, Rachata and Pekhimenko, Gennady and Luo, Yixin and Mutlu, Onur and Gibbons, Phillip B. and Kozuch, Michael A. and Mowry, Todd C.},
date = {2013-12-07}, date = {2013-12-07},
pages = {185--197}, pages = {185--197},
publisher = {{ACM}}, publisher = {ACM},
location = {{Davis California}}, location = {Davis California},
doi = {10.1145/2540708.2540725}, doi = {10.1145/2540708.2540725},
url = {https://dl.acm.org/doi/10.1145/2540708.2540725}, url = {https://dl.acm.org/doi/10.1145/2540708.2540725},
urldate = {2024-02-05}, urldate = {2024-02-05},
@@ -584,8 +584,8 @@
date = {2022}, date = {2022},
volume = {13511}, volume = {13511},
pages = {362--379}, pages = {362--379},
publisher = {{Springer International Publishing}}, publisher = {Springer International Publishing},
location = {{Cham}}, location = {Cham},
doi = {10.1007/978-3-031-15074-6_23}, doi = {10.1007/978-3-031-15074-6_23},
url = {https://link.springer.com/10.1007/978-3-031-15074-6_23}, url = {https://link.springer.com/10.1007/978-3-031-15074-6_23},
urldate = {2024-01-21}, urldate = {2024-01-21},
@@ -615,8 +615,8 @@
@book{systemc2023, @book{systemc2023,
title = {1666-2023 - {{IEEE Standard}} for {{Standard SystemC Language Reference Manual}}}, title = {1666-2023 - {{IEEE Standard}} for {{Standard SystemC Language Reference Manual}}},
date = {2023}, date = {2023},
publisher = {{IEEE}}, publisher = {IEEE},
location = {{New York}}, location = {New York},
abstract = {SystemC® is defined in this standard. SystemC is an ISO standard C++ class library for system and hardware design for use by designers and architects who need to address complex systems that are a hybrid between hardware and software. This standard provides a precise and complete definition of the SystemC class library so that a SystemC implementation can be developed with reference to this standard alone. The primary audiences for this standard are the implementors of the SystemC class library, the implementors of tools supporting the class library, and the users of the class library}, abstract = {SystemC® is defined in this standard. SystemC is an ISO standard C++ class library for system and hardware design for use by designers and architects who need to address complex systems that are a hybrid between hardware and software. This standard provides a precise and complete definition of the SystemC class library so that a SystemC implementation can be developed with reference to this standard alone. The primary audiences for this standard are the implementors of the SystemC class library, the implementors of tools supporting the class library, and the users of the class library},
isbn = {978-1-5044-9867-8}, isbn = {978-1-5044-9867-8},
langid = {english}, langid = {english},
@@ -624,6 +624,14 @@
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/46IIZIMH/2023 - 1666-2023 - IEEE Standard for Standard SystemC Lan.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/46IIZIMH/2023 - 1666-2023 - IEEE Standard for Standard SystemC Lan.pdf}
} }
@online{tesla2018,
title = {{{NVIDIA Tesla V100 PCIe}} 32 {{GB Specs}}},
author = {{techpowerup.com}},
date = {2018},
url = {https://www.techpowerup.com/gpu-specs/tesla-v100-pcie-32-gb.c3184},
urldate = {2024-03-07}
}
@online{touvron2023, @online{touvron2023,
title = {{{LLaMA}}: {{Open}} and {{Efficient Foundation Language Models}}}, title = {{{LLaMA}}: {{Open}} and {{Efficient Foundation Language Models}}},
shorttitle = {{{LLaMA}}}, shorttitle = {{{LLaMA}}},
@@ -639,6 +647,14 @@
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/MGQYNDPQ/Touvron et al. - 2023 - LLaMA Open and Efficient Foundation Language Mode.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/YDAT8K7L/2302.html} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/MGQYNDPQ/Touvron et al. - 2023 - LLaMA Open and Efficient Foundation Language Mode.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/YDAT8K7L/2302.html}
} }
@online{vega2017,
title = {{{AMD Radeon RX Vega}} 56 {{Specs}}},
author = {{techpowerup.com}},
date = {2017},
url = {https://www.techpowerup.com/gpu-specs/radeon-rx-vega-56.c2993},
urldate = {2024-03-07}
}
@article{zou2021, @article{zou2021,
title = {Breaking the von {{Neumann}} Bottleneck: Architecture-Level Processing-in-Memory Technology}, title = {Breaking the von {{Neumann}} Bottleneck: Architecture-Level Processing-in-Memory Technology},
shorttitle = {Breaking the von {{Neumann}} Bottleneck}, shorttitle = {Breaking the von {{Neumann}} Bottleneck},

View File

@@ -85,13 +85,13 @@
\setcounter{page}{1} \setcounter{page}{1}
% Chapters % Chapters
% \include{chapters/introduction} \include{chapters/introduction}
% \include{chapters/dram} \include{chapters/dram}
% \include{chapters/pim} \include{chapters/pim}
% \include{chapters/vp} \include{chapters/vp}
% \include{chapters/implementation} \include{chapters/implementation}
\include{chapters/results} \include{chapters/results}
% \include{chapters/conclusion} \include{chapters/conclusion}
% Appendix % Appendix
\appendix \appendix

View File

@@ -4,7 +4,7 @@
width=0.9\textwidth, width=0.9\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0.1, ymin=0,
ymax=10, ymax=10,
ymajorgrids, ymajorgrids,
ylabel={Relative Performance}, ylabel={Relative Performance},

View File

@@ -4,8 +4,8 @@
width=0.9\textwidth, width=0.9\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0.1, ymin=0,
ymax=75, ymax=80,
ymajorgrids, ymajorgrids,
ylabel={Relative Performance}, ylabel={Relative Performance},
tick pos=left, tick pos=left,

View File

@@ -1,5 +1,5 @@
level,vadd,vmul,haxpy,gemv,dnn level,vadd,vmul,haxpy,gemv,dnn
X1,911446480,911416480,954454480,951904860,536177760 X1,911403240,911388240,954411240,907312430,301697880
X2,1822806480,1822776480,1908822480,1814530860,738329760 X2,1822763240,1822748240,1908779240,1769985430,504205880
X3,3645526480,3645496480,3817558480,6990944860,1547139760 X3,3645483240,3645468240,3817515240,6946352430,1312770880
X4,7290966480,7290936480,7635030480,13892610860,4782339760 X4,7290923240,7290908240,7634987240,13848065430,4547969880
1 level vadd vmul haxpy gemv dnn
2 X1 911446480 911403240 911416480 911388240 954454480 954411240 951904860 907312430 536177760 301697880
3 X2 1822806480 1822763240 1822776480 1822748240 1908822480 1908779240 1814530860 1769985430 738329760 504205880
4 X3 3645526480 3645483240 3645496480 3645468240 3817558480 3817515240 6990944860 6946352430 1547139760 1312770880
5 X4 7290966480 7290923240 7290936480 7290908240 7635030480 7634987240 13892610860 13848065430 4782339760 4547969880

View File

@@ -1,5 +1,5 @@
level,vadd,vmul,haxpy,gemv,dnn level,vadd,vmul,haxpy,gemv,dnn
X1,1475510346,1475512344,1543044078,1377734886,933823908 X1,1475478045,1475481042,1542998124,1300650714,514715103
X2,2950962084,2950964082,3085992252,2601142920,1220409702 X2,2950928118,2950925121,3085945965,2524123683,801373158
X3,5901852240,5901848244,6171893928,9942655392,2367353610 X3,5901817275,5901817941,6171846975,9865572552,1948383000
X4,11803639878,11803641876,12343693950,19731271644,6955629408 X4,11803603914,11803603914,12343646997,19654106886,6536710746
1 level vadd vmul haxpy gemv dnn
2 X1 1475510346 1475478045 1475512344 1475481042 1543044078 1542998124 1377734886 1300650714 933823908 514715103
3 X2 2950962084 2950928118 2950964082 2950925121 3085992252 3085945965 2601142920 2524123683 1220409702 801373158
4 X3 5901852240 5901817275 5901848244 5901817941 6171893928 6171846975 9942655392 9865572552 2367353610 1948383000
5 X4 11803639878 11803603914 11803641876 11803603914 12343693950 12343646997 19731271644 19654106886 6955629408 6536710746

View File

@@ -49,9 +49,9 @@
\addplot[fill=_orange!90] table [x expr=\coordindex, y={gemv}]{\hbmpim}; \addplot[fill=_orange!90] table [x expr=\coordindex, y={gemv}]{\hbmpim};
\addlegendentry{PIM ARM} \addlegendentry{PIM ARM}
\addplot[fill=_yellow!90] table [x expr=\coordindex, y={gemv}]{\hbminf}; \addplot[fill=_yellow!90] table [x expr=\coordindex, y={gemv}]{\hbminf};
\addlegendentry{Non-PIM Inf} \addlegendentry{Non-PIM Inf.}
\addplot[fill=_green!90] table [x expr=\coordindex, y={gemv}]{\piminf}; \addplot[fill=_green!90] table [x expr=\coordindex, y={gemv}]{\piminf};
\addlegendentry{PIM Inf} \addlegendentry{PIM Inf.}
\addplot[fill=_darkblue!90] table [x expr=\coordindex, y={gemv}]{\vega}; \addplot[fill=_darkblue!90] table [x expr=\coordindex, y={gemv}]{\vega};
\addlegendentry{Vega} \addlegendentry{Vega}
\addplot[fill=violet!90] table [x expr=\coordindex, y={gemv}]{\tesla}; \addplot[fill=violet!90] table [x expr=\coordindex, y={gemv}]{\tesla};

View File

@@ -62,9 +62,9 @@
\addplot[fill=_orange!90] table [x expr=\coordindex, y={vmul}]{\hbmpim}; \addplot[fill=_orange!90] table [x expr=\coordindex, y={vmul}]{\hbmpim};
\addlegendentry{PIM ARM} \addlegendentry{PIM ARM}
\addplot[fill=_yellow!90] table [x expr=\coordindex, y={vmul}]{\hbminf}; \addplot[fill=_yellow!90] table [x expr=\coordindex, y={vmul}]{\hbminf};
\addlegendentry{Non-PIM Inf} \addlegendentry{Non-PIM Inf.}
\addplot[fill=_green!90] table [x expr=\coordindex, y={vmul}]{\piminf}; \addplot[fill=_green!90] table [x expr=\coordindex, y={vmul}]{\piminf};
\addlegendentry{PIM Inf} \addlegendentry{PIM Inf.}
\addplot[fill=_darkblue!90] table [x expr=\coordindex, y={vmul}]{\vega}; \addplot[fill=_darkblue!90] table [x expr=\coordindex, y={vmul}]{\vega};
\addlegendentry{Vega} \addlegendentry{Vega}
\addplot[fill=violet!90] table [x expr=\coordindex, y={vmul}]{\tesla}; \addplot[fill=violet!90] table [x expr=\coordindex, y={vmul}]{\tesla};

Binary file not shown.

BIN
src/plots/samsung_old.pdf Normal file

Binary file not shown.

View File

@@ -1,5 +1,5 @@
level,gemv,dnn level,gemv,dnn
X1,8.316378361593825,0.3293391169376365 X1,8.725110246753701,0.5853017926410354
X2,8.707496426927674,2.520540889480061 X2,8.926639006288317,3.6909334536122427
X3,8.952627753954278,4.565606038073768 X3,9.010099560986427,5.380703318160134
X4,9.178586247394538,5.717860064798073 X4,9.208111243015697,6.012517728019782
1 level gemv dnn
2 X1 8.316378361593825 8.725110246753701 0.3293391169376365 0.5853017926410354
3 X2 8.707496426927674 8.926639006288317 2.520540889480061 3.6909334536122427
4 X3 8.952627753954278 9.010099560986427 4.565606038073768 5.380703318160134
5 X4 9.178586247394538 9.208111243015697 5.717860064798073 6.012517728019782

View File

@@ -1,5 +1,5 @@
level,gemv,dnn level,gemv,dnn
X1,52.86562483836241,5.894128466670185 X1,55.99875110667106,10.693445843962344
X2,58.14176507187079,17.599027693570402 X2,59.9158597463229,26.801526765137798
X3,61.79055586904123,36.290342473171975 X3,62.27334503373079,44.09403760041019
X4,62.23855088820042,46.32048031346181 X4,62.48290809788771,49.2890243396736
1 level gemv dnn
2 X1 52.86562483836241 55.99875110667106 5.894128466670185 10.693445843962344
3 X2 58.14176507187079 59.9158597463229 17.599027693570402 26.801526765137798
4 X3 61.79055586904123 62.27334503373079 36.290342473171975 44.09403760041019
5 X4 62.23855088820042 62.48290809788771 46.32048031346181 49.2890243396736

View File

@@ -1,5 +1,5 @@
level,vadd,vmul,haxpy level,vadd,vmul,haxpy
X1,12.912332482758615,10.706896577073085,17.57261802574388 X1,12.912945086743383,10.707228337727948,17.57341416054572
X2,12.656964545133722,10.410011429377231,17.530374532261376 X2,12.657264796496554,10.41017271260676,17.530771651728568
X3,12.857948841452387,10.179649930700249,17.28682620992881 X3,12.858101352840125,10.179728788420332,17.287022013303083
X4,12.517518527941442,10.158700762676236,17.568276160961705 X4,12.5175927651105,10.158740110546228,17.568375657167437
1 level vadd vmul haxpy
2 X1 12.912332482758615 12.912945086743383 10.706896577073085 10.707228337727948 17.57261802574388 17.57341416054572
3 X2 12.656964545133722 12.657264796496554 10.410011429377231 10.41017271260676 17.530374532261376 17.530771651728568
4 X3 12.857948841452387 12.858101352840125 10.179649930700249 10.179728788420332 17.28682620992881 17.287022013303083
5 X4 12.517518527941442 12.5175927651105 10.158700762676236 10.158740110546228 17.568276160961705 17.568375657167437

View File

@@ -1,5 +1,5 @@
level,vadd,vmul,haxpy level,vadd,vmul,haxpy
X1,14.459220593631812,13.63052633194372,23.727849658355645 X1,14.45953713326856,13.630815500508477,23.72855632778462
X2,14.66245011706494,13.551652528382078,23.822458866951166 X2,14.662618885927062,13.55183145055479,23.822816186932165
X3,14.484534634672588,13.652770583844921,23.665607976080565 X3,14.484620447521396,13.652840684263325,23.665788014778187
X4,14.143299569834605,13.81101374775393,23.87690729103017 X4,14.143342662573861,13.811058165857734,23.876998114465763
1 level vadd vmul haxpy
2 X1 14.459220593631812 14.45953713326856 13.63052633194372 13.630815500508477 23.727849658355645 23.72855632778462
3 X2 14.66245011706494 14.662618885927062 13.551652528382078 13.55183145055479 23.822458866951166 23.822816186932165
4 X3 14.484534634672588 14.484620447521396 13.652770583844921 13.652840684263325 23.665607976080565 23.665788014778187
5 X4 14.143299569834605 14.143342662573861 13.81101374775393 13.811058165857734 23.87690729103017 23.876998114465763

View File

@@ -1,21 +1,21 @@
workload,level,hbm,pim workload,level,hbm,pim
VADD,X1,11768899990,455723240 VADD,X1,11768899990,911403240
VADD,X2,23071196990,911403240 VADD,X2,23071196990,1822763240
VADD,X3,46873992980,1822763240 VADD,X3,46873992980,3645483240
VADD,X4,91264808000,3645483240 VADD,X4,91264808000,7290923240
VMUL,X1,9758441990,455708240 VMUL,X1,9758441990,911388240
VMUL,X2,18975123990,911388240 VMUL,X2,18975123990,1822748240
VMUL,X3,37109877990,1822748240 VMUL,X3,37109877990,3645468240
VMUL,X4,74066441980,3645468240 VMUL,X4,74066441980,7290908240
HAXPY,X1,16772264000,477227240 HAXPY,X1,16772264000,954411240
HAXPY,X2,33462372990,954411240 HAXPY,X2,33462372990,1908779240
HAXPY,X3,65993469990,1908779240 HAXPY,X3,65993469990,3817515240
HAXPY,X4,134134323970,3817515240 HAXPY,X4,134134323970,7634987240
GEMV,X1,7916400980,475952430 GEMV,X1,7916400980,907312430
GEMV,X2,15800020980,907265430 GEMV,X2,15800020980,1769985430
GEMV,X3,62587326980,3495472430 GEMV,X3,62587326980,6946352430
GEMV,X4,127514526980,6946305430 GEMV,X4,127514526980,13848065430
DNN,X1,176584310,268088880 DNN,X1,176584310,301697880
DNN,X2,1860990350,369164880 DNN,X2,1860990350,504205880
DNN,X3,7063630630,773569880 DNN,X3,7063630630,1312770880
DNN,X4,27344749530,2391169880 DNN,X4,27344749530,4547969880
1 workload level hbm pim
2 VADD X1 11768899990 455723240 911403240
3 VADD X2 23071196990 911403240 1822763240
4 VADD X3 46873992980 1822763240 3645483240
5 VADD X4 91264808000 3645483240 7290923240
6 VMUL X1 9758441990 455708240 911388240
7 VMUL X2 18975123990 911388240 1822748240
8 VMUL X3 37109877990 1822748240 3645468240
9 VMUL X4 74066441980 3645468240 7290908240
10 HAXPY X1 16772264000 477227240 954411240
11 HAXPY X2 33462372990 954411240 1908779240
12 HAXPY X3 65993469990 1908779240 3817515240
13 HAXPY X4 134134323970 3817515240 7634987240
14 GEMV X1 7916400980 475952430 907312430
15 GEMV X2 15800020980 907265430 1769985430
16 GEMV X3 62587326980 3495472430 6946352430
17 GEMV X4 127514526980 6946305430 13848065430
18 DNN X1 176584310 268088880 301697880
19 DNN X2 1860990350 369164880 504205880
20 DNN X3 7063630630 773569880 1312770880
21 DNN X4 27344749530 2391169880 4547969880

View File

@@ -1,21 +1,21 @@
workload,level,hbm,pim workload,level,hbm,pim
VADD,X1,21334729581,737755173 VADD,X1,21334729581,1475478045
VADD,X2,43268334354,1475481042 VADD,X2,43268334354,2950928118
VADD,X3,85485583179,2950926120 VADD,X3,85485583179,5901817275
VADD,X4,166942414809,5901819939 VADD,X4,166942414809,11803603914
VMUL,X1,20112009858,737756172 VMUL,X1,20112009858,1475481042
VMUL,X2,39990439863,1475482041 VMUL,X2,39990439863,2950925121
VMUL,X3,80576580096,2950924122 VMUL,X3,80576580096,5901817941
VMUL,X4,163020260223,5901820938 VMUL,X4,163020260223,11803603914
HAXPY,X1,36613117899,771522039 HAXPY,X1,36613117899,1542998124
HAXPY,X2,73515923487,1542996126 HAXPY,X2,73515923487,3085945965
HAXPY,X3,146061622170,3085946964 HAXPY,X3,146061622170,6171846975
HAXPY,X4,294729236073,6171846975 HAXPY,X4,294729236073,12343646997
GEMV,X1,72834815610,688867443 GEMV,X1,72834815610,1300650714
GEMV,X2,151235040573,1300571460 GEMV,X2,151235040573,2524123683
GEMV,X3,614362203486,4971327696 GEMV,X3,614362203486,9865572552
GEMV,X4,1228045754304,9865635822 GEMV,X4,1228045754304,19654106886
DNN,X1,5504078079,466911954 DNN,X1,5504078079,514715103
DNN,X2,21478024143,610204851 DNN,X2,21478024143,801373158
DNN,X3,85912073262,1183676805 DNN,X3,85912073262,1948383000
DNN,X4,322188095061,3477814704 DNN,X4,322188095061,6536710746
1 workload level hbm pim
2 VADD X1 21334729581 737755173 1475478045
3 VADD X2 43268334354 1475481042 2950928118
4 VADD X3 85485583179 2950926120 5901817275
5 VADD X4 166942414809 5901819939 11803603914
6 VMUL X1 20112009858 737756172 1475481042
7 VMUL X2 39990439863 1475482041 2950925121
8 VMUL X3 80576580096 2950924122 5901817941
9 VMUL X4 163020260223 5901820938 11803603914
10 HAXPY X1 36613117899 771522039 1542998124
11 HAXPY X2 73515923487 1542996126 3085945965
12 HAXPY X3 146061622170 3085946964 6171846975
13 HAXPY X4 294729236073 6171846975 12343646997
14 GEMV X1 72834815610 688867443 1300650714
15 GEMV X2 151235040573 1300571460 2524123683
16 GEMV X3 614362203486 4971327696 9865572552
17 GEMV X4 1228045754304 9865635822 19654106886
18 DNN X1 5504078079 466911954 514715103
19 DNN X2 21478024143 610204851 801373158
20 DNN X3 85912073262 1183676805 1948383000
21 DNN X4 322188095061 3477814704 6536710746

View File

@@ -1,21 +1,21 @@
workload,level,vega,tesla workload,level,vega,tesla
VADD,X1,69572650,TODO VADD,X1,69572650,69572650
VADD,X2,123217536,TODO VADD,X2,123217536,123217536
VADD,X3,207693503,TODO VADD,X3,207693503,207693503
VADD,X4,378089165,TODO VADD,X4,378089165,378089165
VMUL,X1,67408281,TODO VMUL,X1,67408281,67408281
VMUL,X2,103994272,TODO VMUL,X2,103994272,103994272
VMUL,X3,182162140,TODO VMUL,X3,182162140,182162140
VMUL,X4,350280326,TODO VMUL,X4,350280326,350280326
HAXPY,X1,69791189,TODO HAXPY,X1,69791189,69791189
HAXPY,X2,123543145,TODO HAXPY,X2,123543145,123543145
HAXPY,X3,207947543,TODO HAXPY,X3,207947543,207947543
HAXPY,X4,377434890,TODO HAXPY,X4,377434890,377434890
GEMV,X1,750246152,TODO GEMV,X1,750246152,750246152
GEMV,X2,648714601,TODO GEMV,X2,648714601,648714601
GEMV,X3,2454455479,TODO GEMV,X3,2454455479,2454455479
GEMV,X4,4968984949,TODO GEMV,X4,4968984949,4968984949
DNN,X1,231093065,TODO DNN,X1,231093065,231093065
DNN,X2,431703456,TODO DNN,X2,431703456,431703456
DNN,X3,877622611,TODO DNN,X3,877622611,877622611
DNN,X4,2175751385,TODO DNN,X4,2175751385,2175751385
1 workload level vega tesla
2 VADD X1 69572650 TODO 69572650
3 VADD X2 123217536 TODO 123217536
4 VADD X3 207693503 TODO 207693503
5 VADD X4 378089165 TODO 378089165
6 VMUL X1 67408281 TODO 67408281
7 VMUL X2 103994272 TODO 103994272
8 VMUL X3 182162140 TODO 182162140
9 VMUL X4 350280326 TODO 350280326
10 HAXPY X1 69791189 TODO 69791189
11 HAXPY X2 123543145 TODO 123543145
12 HAXPY X3 207947543 TODO 207947543
13 HAXPY X4 377434890 TODO 377434890
14 GEMV X1 750246152 TODO 750246152
15 GEMV X2 648714601 TODO 648714601
16 GEMV X3 2454455479 TODO 2454455479
17 GEMV X4 4968984949 TODO 4968984949
18 DNN X1 231093065 TODO 231093065
19 DNN X2 431703456 TODO 431703456
20 DNN X3 877622611 TODO 877622611
21 DNN X4 2175751385 TODO 2175751385