Vector simulations

This commit is contained in:
2024-03-01 15:34:40 +01:00
parent ee2405aaa9
commit 47796cdae5
9 changed files with 115 additions and 31 deletions

View File

@@ -94,7 +94,7 @@ The workloads adhere to the following calculation patterns:
Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable. Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable.
\Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks. \Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks.
The levels X1-X4 denote the increasing dimensions, with each step doubling in size. The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 256, which is the minimum size that can be represented in a \ac{pim} data structure.
\begin{table} \begin{table}
\centering \centering
@@ -108,7 +108,7 @@ The levels X1-X4 denote the increasing dimensions, with each step doubling in si
hline{2} = {-}{solid,black}, hline{2} = {-}{solid,black},
hline{2} = {2}{-}{solid,black}, hline{2} = {2}{-}{solid,black},
} }
Level & Dimensions \\ Level & Vector Dimensions \\
X1 & (256 $\times$ 1) \\ X1 & (256 $\times$ 1) \\
X2 & (512 $\times$ 1) \\ X2 & (512 $\times$ 1) \\
X3 & (1024 $\times$ 1) \\ X3 & (1024 $\times$ 1) \\
@@ -118,9 +118,9 @@ X4 & (2048 $\times$ 1)
\label{tab:dimensions_vector} \label{tab:dimensions_vector}
\end{table} \end{table}
The benchmarks analyze the relative number of processor ticks where the speedup is calculated as follows: The benchmarks analyze the relative number of processor ticks for \ac{pim} compared to non-\ac{pim} where the speedup $S$ is calculated as follows:
\begin{equation} \begin{equation}
S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode} S = \frac{\textrm{\#ticks in non-\ac{pim} mode}}{\textrm{\#ticks in \ac{pim} mode}}
\end{equation} \end{equation}
\begin{figure} \begin{figure}
@@ -130,13 +130,27 @@ S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
\label{fig:vector_normal} \label{fig:vector_normal}
\end{figure} \end{figure}
\Cref{fig:vector_normal} shows the relative performance for the vector benchmarks, running on the generic ARM-based system at a typical clock frequency.
The relative speedup of \ac{pim} is in the range of about $\qtyrange{12.8}{31.8}{\times}$ with limited variance for each benchmark between the different vector dimensions, since such vector operations essentially scale linearly with the length of the input operands for both the non-\ac{pim} and \ac{pim} approaches.
The \ac{haxpy} benchmark has the highest variance with a range of $\qtyrange{19.8}{31.8}{\times}$, which is due to the fact that each value of the one input vector must first be multiplied by a scalar amount on the \ac{cpu} before the addition operation, while in the \ac{pim} case the specialized \ac{mad} instruction is used.
As all speedup values are well above 1, it can be concluded that even the smallest representable vector size of 256 is already above the break-even point at which \ac{pim} becomes viable.
\begin{figure} \begin{figure}
\centering \centering
\input{plots/vector_infinite} \input{plots/vector_infinite}
\caption{test} \caption{Comparison between non-\ac{pim} and \ac{pim} for the vector benchmarks running on the infinite compute platform.}
\label{fig:vector_infinite} \label{fig:vector_infinite}
\end{figure} \end{figure}
In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}.
As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{1.7}{2.4}{\times}$ lower than in the generic system.
The variance of the speedup between the different vector dimensions are also rather small.
For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows:
The additional computation step of the scalar multiplication does not affect the non-\ac{pim} system as much as in the previous case, because this is insignificant compared to the memory fetch of the vector elements.
% vectors: im wesentlichen skaliert beides mit der länge es vecktors, minimal weniger overhead
% haxpy: skalarmultiplikation macht CPU bedeutend langsamer, deswegen fällt dieser unterscheid bei 100GHz auch weg
\subsubsection{Neural Network Layers} \subsubsection{Neural Network Layers}
% GEMV % GEMV
% Samsung 7.4x-8.9x % Samsung 7.4x-8.9x
@@ -145,20 +159,61 @@ S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
% GEMM mit stark interleavten matrizen (eher nicht) % GEMM mit stark interleavten matrizen (eher nicht)
In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also analyzed for the level 2 \ac{blas} routine \ac{gemv}.
Besides the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network.
Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer.
In the simplest form, quadratic matrix dimensions ensure that the output vector has the same dimensions as the input vector, which simplifies chaining in the benchmark.
Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}.
\begin{table}
\centering
\begin{tblr}{
cell{2}{2} = {r},
cell{3}{2} = {r},
cell{4}{2} = {r},
cell{5}{2} = {r},
cell{2}{3} = {r},
cell{3}{3} = {r},
cell{4}{3} = {r},
cell{5}{3} = {r},
hlines,
vlines,
hline{2} = {-}{solid,black},
hline{2} = {2}{-}{solid,black},
}
Level & \ac{gemv} Matrix Dimensions & \ac{dnn} Matrix Dimensions \\
X1 & (128 $\times$ 128) & (128 $\times$ 128) \\
X2 & (256 $\times$ 128) & (256 $\times$ 256) \\
X3 & (512 $\times$ 128) & (512 $\times$ 512) \\
X4 & (1024 $\times$ 128) & (1024 $\times$ 1024)
\end{tblr}
\caption{List of the matrix dimensions for the neural network benchmarks.}
\label{tab:dimensions_matrix}
\end{table}
\begin{figure} \begin{figure}
\centering \centering
\input{plots/matrix_normal} \input{plots/matrix_normal}
\caption{test} \caption{Normal}
\label{fig:matrix_normal} \label{fig:matrix_normal}
\end{figure} \end{figure}
\begin{figure} \begin{figure}
\centering \centering
\input{plots/matrix_infinite} \input{plots/matrix_infinite}
\caption{test} \caption{Infinite Compute}
\label{fig:matrix_infinite} \label{fig:matrix_infinite}
\end{figure} \end{figure}
\subsubsection{Comparison to Samsung's Simulation Results}
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{plots/samsung}
\caption{Samsung}
\label{fig:samsung_speedup}
\end{figure}
\subsubsection{Comparison to Real Hardware} \subsubsection{Comparison to Real Hardware}
% \subsubsection{Initialization Overhead} % \subsubsection{Initialization Overhead}

View File

@@ -128,6 +128,22 @@
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/T3PBGTZZ/Ghose et al. - 2019 - Processing-in-memory A workload-driven perspectiv.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/T3PBGTZZ/Ghose et al. - 2019 - Processing-in-memory A workload-driven perspectiv.pdf}
} }
@online{giannoula2024,
title = {Accelerating {{Graph Neural Networks}} on {{Real Processing-In-Memory Systems}}},
author = {Giannoula, Christina and Yang, Peiming and Vega, Ivan Fernandez and Yang, Jiacheng and Li, Yu Xin and Luna, Juan Gomez and Sadrosadati, Mohammad and Mutlu, Onur and Pekhimenko, Gennady},
date = {2024-02-26},
eprint = {2402.16731},
eprinttype = {arxiv},
eprintclass = {cs},
url = {http://arxiv.org/abs/2402.16731},
urldate = {2024-02-29},
abstract = {Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memoryintensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML framework that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively, to match their algorithmic nature. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04×, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim will be open-sourced to enable the widespread use of PIM systems in GNNs.},
langid = {english},
pubstate = {preprint},
keywords = {Computer Science - Distributed Parallel and Cluster Computing,Computer Science - Hardware Architecture,Computer Science - Machine Learning,Computer Science - Performance},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/WFEPGE5V/Giannoula et al. - 2024 - Accelerating Graph Neural Networks on Real Process.pdf}
}
@online{gomez-luna2022, @online{gomez-luna2022,
title = {Benchmarking a {{New Paradigm}}: {{An Experimental Analysis}} of a {{Real Processing-in-Memory Architecture}}}, title = {Benchmarking a {{New Paradigm}}: {{An Experimental Analysis}} of a {{Real Processing-in-Memory Architecture}}},
shorttitle = {Benchmarking a {{New Paradigm}}}, shorttitle = {Benchmarking a {{New Paradigm}}},
@@ -415,6 +431,13 @@
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/E6FRVMZ3/Nielsen - 2015 - Neural networks and deep learning.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/E6FRVMZ3/Nielsen - 2015 - Neural networks and deep learning.pdf}
} }
@article{oliveira,
title = {{{PUMA}}: {{Efficient}} and {{Low-Cost Memory Allocation}} and {{Alignment Support}} for {{Processing-Using-Memory Architectures}}},
author = {Oliveira, Geraldo F and Esposito, Emanuele G and Gómez-Luna, Juan and Mutlu, Onur},
langid = {english},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/RY2GICEL/Oliveira et al. - PUMA Efficient and Low-Cost Memory Allocation and .pdf}
}
@online{oliveira2023, @online{oliveira2023,
title = {{{DaPPA}}: {{A Data-Parallel Framework}} for {{Processing-in-Memory Architectures}}}, title = {{{DaPPA}}: {{A Data-Parallel Framework}} for {{Processing-in-Memory Architectures}}},
shorttitle = {{{DaPPA}}}, shorttitle = {{{DaPPA}}},

View File

@@ -5,11 +5,12 @@
width=0.9\textwidth, width=0.9\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0, ymin=0.1,
ymax=5, ymax=100,
ytick distance=1, ymode=log,
log origin=infty,
ymajorgrids, ymajorgrids,
ylabel={Speedup}, ylabel={Relative Performance},
tick pos=left, tick pos=left,
xtick=data, xtick=data,
xticklabels from table={\gemv}{level}, xticklabels from table={\gemv}{level},
@@ -26,6 +27,6 @@
\addlegendentry{GEMV} \addlegendentry{GEMV}
\addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers}; \addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers};
\addlegendentry{DNN Layers} \addlegendentry{DNN}
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}

View File

@@ -5,11 +5,13 @@
width=0.9\textwidth, width=0.9\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0, ymin=0.1,
ymax=35, ymax=100,
minor y tick num = 5, ymode=log,
log origin=infty,
% minor y tick num = 5,
ymajorgrids, ymajorgrids,
ylabel={Speedup}, ylabel={Relative Performance},
tick pos=left, tick pos=left,
xtick=data, xtick=data,
xticklabels from table={\gemv}{level}, xticklabels from table={\gemv}{level},
@@ -26,6 +28,6 @@
\addlegendentry{GEMV} \addlegendentry{GEMV}
\addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers}; \addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers};
\addlegendentry{DNN Layers} \addlegendentry{DNN}
\end{axis} \end{axis}
\end{tikzpicture} \end{tikzpicture}

BIN
src/plots/samsung.pdf Normal file

Binary file not shown.

View File

@@ -1,5 +1,5 @@
workload,level,frequency,speedup workload,level,frequency,speedup
gemv_layers,X1,100GHz,0.18703680911951287 gemv_layers,X1,100GHz,0.17890250001597863
gemv_layers,X2,100GHz,0.35722454947444127 gemv_layers,X2,100GHz,0.6097840333112959
gemv_layers,X3,100GHz,0.6338568319278073 gemv_layers,X3,100GHz,3.9637284525723304
gemv_layers,X4,100GHz,1.638629460755059 gemv_layers,X4,100GHz,6.088778065749799
1 workload level frequency speedup
2 gemv_layers X1 100GHz 0.18703680911951287 0.17890250001597863
3 gemv_layers X2 100GHz 0.35722454947444127 0.6097840333112959
4 gemv_layers X3 100GHz 0.6338568319278073 3.9637284525723304
5 gemv_layers X4 100GHz 1.638629460755059 6.088778065749799

View File

@@ -1,5 +1,5 @@
workload,level,frequency,speedup workload,level,frequency,speedup
gemv_layers,X1,3GHz,3.194018430394461 gemv_layers,X1,3GHz,2.992752194063702
gemv_layers,X2,3GHz,6.206580081241512 gemv_layers,X2,3GHz,11.246371082010572
gemv_layers,X3,3GHz,11.305511591995977 gemv_layers,X3,3GHz,34.94598413478715
gemv_layers,X4,3GHz,20.27760945615218 gemv_layers,X4,3GHz,72.33604077371677
1 workload level frequency speedup
2 gemv_layers X1 3GHz 3.194018430394461 2.992752194063702
3 gemv_layers X2 3GHz 6.206580081241512 11.246371082010572
4 gemv_layers X3 3GHz 11.305511591995977 34.94598413478715
5 gemv_layers X4 3GHz 20.27760945615218 72.33604077371677

View File

@@ -3,14 +3,15 @@
\pgfplotstableread[col sep=comma]{plots/tables/vmul_100GHz.csv}\vmul \pgfplotstableread[col sep=comma]{plots/tables/vmul_100GHz.csv}\vmul
\pgfplotstableread[col sep=comma]{plots/tables/haxpy_100GHz.csv}\haxpy \pgfplotstableread[col sep=comma]{plots/tables/haxpy_100GHz.csv}\haxpy
\begin{axis}[ \begin{axis}[
width=0.9\textwidth, width=0.8\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0, ymin=0,
ymax=5, ymax=5,
ytick distance=1, % ymode=log,
% log origin=infty,
ymajorgrids, ymajorgrids,
ylabel={Speedup}, ylabel={Relative Performance},
tick pos=left, tick pos=left,
xtick=data, xtick=data,
xticklabels from table={\vadd}{level}, xticklabels from table={\vadd}{level},

View File

@@ -3,14 +3,16 @@
\pgfplotstableread[col sep=comma]{plots/tables/vmul_3GHz.csv}\vmul \pgfplotstableread[col sep=comma]{plots/tables/vmul_3GHz.csv}\vmul
\pgfplotstableread[col sep=comma]{plots/tables/haxpy_3GHz.csv}\haxpy \pgfplotstableread[col sep=comma]{plots/tables/haxpy_3GHz.csv}\haxpy
\begin{axis}[ \begin{axis}[
width=0.9\textwidth, width=0.8\textwidth,
ybar=1pt, ybar=1pt,
bar width = 15pt, bar width = 15pt,
ymin=0, ymin=0,
ymax=35, ymax=35,
minor y tick num = 5, % ymode=log,
% log origin=infty,
% minor y tick num = 5,
ymajorgrids, ymajorgrids,
ylabel={Speedup}, ylabel={Relative Performance},
tick pos=left, tick pos=left,
xtick=data, xtick=data,
xticklabels from table={\vadd}{level}, xticklabels from table={\vadd}{level},