Vector simulations

2024-03-01 15:34:40 +01:00
parent ee2405aaa9
commit 47796cdae5
9 changed files with 115 additions and 31 deletions
--- a/src/chapters/results.tex
+++ b/src/chapters/results.tex
@@ -94,7 +94,7 @@ The workloads adhere to the following calculation patterns:
 Each workload is run with different input vector dimensions to examine the effect of setup overhead and potentially identify a break-even point at which \ac{pim} becomes viable.
 \Cref{tab:dimensions_vector} lists the specific vector dimensions for the following benchmarks.
-The levels X1-X4 denote the increasing dimensions, with each step doubling in size.
+The levels X1-X4 denote the increasing dimensions, with each successive level doubling in size, starting at 256, which is the minimum size that can be represented in a \ac{pim} data structure.
 \begin{table}
 \centering
@@ -108,7 +108,7 @@ The levels X1-X4 denote the increasing dimensions, with each step doubling in si
  hline{2} = {-}{solid,black},
  hline{2} = {2}{-}{solid,black},
 }
-Level & Dimensions        \\
+Level & Vector Dimensions \\
 X1    & (256 $\times$ 1)  \\
 X2    & (512 $\times$ 1)  \\
 X3    & (1024 $\times$ 1) \\
@@ -118,9 +118,9 @@ X4    & (2048 $\times$ 1)
 \label{tab:dimensions_vector}
 \end{table}
-The benchmarks analyze the relative number of processor ticks where the speedup is calculated as follows:
+The benchmarks analyze the relative number of processor ticks for \ac{pim} compared to non-\ac{pim} where the speedup $S$ is calculated as follows:
 \begin{equation}
-S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
+S = \frac{\textrm{\#ticks in non-\ac{pim} mode}}{\textrm{\#ticks in \ac{pim} mode}}
 \end{equation}
 \begin{figure}
@@ -130,13 +130,27 @@ S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
    \label{fig:vector_normal}
 \end{figure}
 \Cref{fig:vector_normal} shows the relative performance for the vector benchmarks, running on the generic ARM-based system at a typical clock frequency.
 The relative speedup of \ac{pim} is in the range of about $\qtyrange{12.8}{31.8}{\times}$ with limited variance for each benchmark between the different vector dimensions, since such vector operations essentially scale linearly with the length of the input operands for both the non-\ac{pim} and \ac{pim} approaches.
 The \ac{haxpy} benchmark has the highest variance with a range of $\qtyrange{19.8}{31.8}{\times}$, which is due to the fact that each value of the one input vector must first be multiplied by a scalar amount on the \ac{cpu} before the addition operation, while in the \ac{pim} case the specialized \ac{mad} instruction is used.
 As all speedup values are well above 1, it can be concluded that even the smallest representable vector size of 256 is already above the break-even point at which \ac{pim} becomes viable.
 \begin{figure}
    \centering
    \input{plots/vector_infinite}
-    \caption{test}
+    \caption{Comparison between non-\ac{pim} and \ac{pim} for the vector benchmarks running on the infinite compute platform.}
    \label{fig:vector_infinite}
 \end{figure}
 In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}.
 As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{1.7}{2.4}{\times}$ lower than in the generic system.
 The variance of the speedup between the different vector dimensions are also rather small.
 For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows:
 The additional computation step of the scalar multiplication does not affect the non-\ac{pim} system as much as in the previous case, because this is insignificant compared to the memory fetch of the vector elements.
 % vectors: im wesentlichen skaliert beides mit der länge es vecktors, minimal weniger overhead
 % haxpy: skalarmultiplikation macht CPU bedeutend langsamer, deswegen fällt dieser unterscheid bei 100GHz auch weg
 \subsubsection{Neural Network Layers}
 % GEMV
    % Samsung 7.4x-8.9x
@@ -145,20 +159,61 @@ S = \frac{\textrm{# of ticks in non-\ac{pim} mode}}{# of ticks in \ac{pim} mode}
 % GEMM mit stark interleavten matrizen (eher nicht)
 In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also analyzed for the level 2 \ac{blas} routine \ac{gemv}.
 Besides the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network.
 Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer.
 In the simplest form, quadratic matrix dimensions ensure that the output vector has the same dimensions as the input vector, which simplifies chaining in the benchmark.
 Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}.
 \begin{table}
 \centering
 \begin{tblr}{
  cell{2}{2} = {r},
  cell{3}{2} = {r},
  cell{4}{2} = {r},
  cell{5}{2} = {r},
  cell{2}{3} = {r},
  cell{3}{3} = {r},
  cell{4}{3} = {r},
  cell{5}{3} = {r},
  hlines,
  vlines,
  hline{2} = {-}{solid,black},
  hline{2} = {2}{-}{solid,black},
 }
 Level & \ac{gemv} Matrix Dimensions & \ac{dnn} Matrix Dimensions   \\
 X1    & (128 $\times$ 128)  & (128 $\times$ 128) \\
 X2    & (256 $\times$ 128)  & (256 $\times$ 256) \\
 X3    & (512 $\times$ 128)  & (512 $\times$ 512) \\
 X4    & (1024 $\times$ 128) & (1024 $\times$ 1024)
 \end{tblr}
 \caption{List of the matrix dimensions for the neural network benchmarks.}
 \label{tab:dimensions_matrix}
 \end{table}
 \begin{figure}
    \centering
    \input{plots/matrix_normal}
-    \caption{test}
+    \caption{Normal}
    \label{fig:matrix_normal}
 \end{figure}
 \begin{figure}
    \centering
    \input{plots/matrix_infinite}
-    \caption{test}
+    \caption{Infinite Compute}
    \label{fig:matrix_infinite}
 \end{figure}
 \subsubsection{Comparison to Samsung's Simulation Results}
 \begin{figure}
    \centering
    \includegraphics[width=0.8\linewidth]{plots/samsung}
    \caption{Samsung}
    \label{fig:samsung_speedup}
 \end{figure}
 \subsubsection{Comparison to Real Hardware}
 % \subsubsection{Initialization Overhead}
--- a/src/doc.bib
+++ b/src/doc.bib
@@ -128,6 +128,22 @@
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/T3PBGTZZ/Ghose et al. - 2019 - Processing-in-memory A workload-driven perspectiv.pdf}
 }
@online{giannoula2024,
  title = {Accelerating {{Graph Neural Networks}} on {{Real Processing-In-Memory Systems}}},
  author = {Giannoula, Christina and Yang, Peiming and Vega, Ivan Fernandez and Yang, Jiacheng and Li, Yu Xin and Luna, Juan Gomez and Sadrosadati, Mohammad and Mutlu, Onur and Pekhimenko, Gennady},
  date = {2024-02-26},
  eprint = {2402.16731},
  eprinttype = {arxiv},
  eprintclass = {cs},
  url = {http://arxiv.org/abs/2402.16731},
  urldate = {2024-02-29},
  abstract = {Graph Neural Networks (GNNs) are emerging ML models to analyze graph-structure data. Graph Neural Network (GNN) execution involves both compute-intensive and memoryintensive kernels, the latter dominates the total time, being significantly bottlenecked by data movement between memory and processors. Processing-In-Memory (PIM) systems can alleviate this data movement bottleneck by placing simple processors near or inside to memory arrays. In this work, we introduce PyGim, an efficient ML framework that accelerates GNNs on real PIM systems. We propose intelligent parallelization techniques for memory-intensive kernels of GNNs tailored for real PIM systems, and develop handy Python API for them. We provide hybrid GNN execution, in which the compute-intensive and memory-intensive kernels are executed in processor-centric and memory-centric computing systems, respectively, to match their algorithmic nature. We extensively evaluate PyGim on a real-world PIM system with 1992 PIM cores using emerging GNN models, and demonstrate that it outperforms its state-of-the-art CPU counterpart on Intel Xeon by on average 3.04×, and achieves higher resource utilization than CPU and GPU systems. Our work provides useful recommendations for software, system and hardware designers. PyGim will be open-sourced to enable the widespread use of PIM systems in GNNs.},
  langid = {english},
  pubstate = {preprint},
  keywords = {Computer Science - Distributed Parallel and Cluster Computing,Computer Science - Hardware Architecture,Computer Science - Machine Learning,Computer Science - Performance},
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/WFEPGE5V/Giannoula et al. - 2024 - Accelerating Graph Neural Networks on Real Process.pdf}
 }
@online{gomez-luna2022,
  title = {Benchmarking a {{New Paradigm}}: {{An Experimental Analysis}} of a {{Real Processing-in-Memory Architecture}}},
  shorttitle = {Benchmarking a {{New Paradigm}}},
@@ -415,6 +431,13 @@
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/E6FRVMZ3/Nielsen - 2015 - Neural networks and deep learning.pdf}
 }
@article{oliveira,
  title = {{{PUMA}}: {{Efﬁcient}} and {{Low-Cost Memory Allocation}} and {{Alignment Support}} for {{Processing-Using-Memory Architectures}}},
  author = {Oliveira, Geraldo F and Esposito, Emanuele G and Gómez-Luna, Juan and Mutlu, Onur},
  langid = {english},
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/RY2GICEL/Oliveira et al. - PUMA Efﬁcient and Low-Cost Memory Allocation and .pdf}
 }
@online{oliveira2023,
  title = {{{DaPPA}}: {{A Data-Parallel Framework}} for {{Processing-in-Memory Architectures}}},
  shorttitle = {{{DaPPA}}},
--- a/src/plots/matrix_infinite.tex
+++ b/src/plots/matrix_infinite.tex
@@ -5,11 +5,12 @@
            width=0.9\textwidth,
            ybar=1pt,
            bar width = 15pt,
-            ymin=0,
+            ymin=0.1,
-            ymax=5,
+            ymax=100,
-            ytick distance=1,
+            ymode=log,
            log origin=infty,
            ymajorgrids,
-            ylabel={Speedup},
+            ylabel={Relative Performance},
            tick pos=left,
            xtick=data,
            xticklabels from table={\gemv}{level},
@@ -26,6 +27,6 @@
        \addlegendentry{GEMV}
        \addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers};
-        \addlegendentry{DNN Layers}
+        \addlegendentry{DNN}
    \end{axis}
 \end{tikzpicture}
--- a/src/plots/matrix_normal.tex
+++ b/src/plots/matrix_normal.tex
@@ -5,11 +5,13 @@
            width=0.9\textwidth,
            ybar=1pt,
            bar width = 15pt,
-            ymin=0,
+            ymin=0.1,
-            ymax=35,
+            ymax=100,
-            minor y tick num = 5,
+            ymode=log,
            log origin=infty,
            % minor y tick num = 5,
            ymajorgrids,
-            ylabel={Speedup},
+            ylabel={Relative Performance},
            tick pos=left,
            xtick=data,
            xticklabels from table={\gemv}{level},
@@ -26,6 +28,6 @@
        \addlegendentry{GEMV}
        \addplot[fill=_orange!90] table [x expr=\coordindex, y={speedup}]{\gemvlayers};
-        \addlegendentry{DNN Layers}
+        \addlegendentry{DNN}
    \end{axis}
 \end{tikzpicture}
--- a/src/plots/samsung.pdf
+++ b/src/plots/samsung.pdf
--- a/src/plots/tables/gemv_layers_100GHz.csv
+++ b/src/plots/tables/gemv_layers_100GHz.csv
@@ -1,5 +1,5 @@
 workload,level,frequency,speedup
-gemv_layers,X1,100GHz,0.18703680911951287
+gemv_layers,X1,100GHz,0.17890250001597863
-gemv_layers,X2,100GHz,0.35722454947444127
+gemv_layers,X2,100GHz,0.6097840333112959
-gemv_layers,X3,100GHz,0.6338568319278073
+gemv_layers,X3,100GHz,3.9637284525723304
-gemv_layers,X4,100GHz,1.638629460755059
+gemv_layers,X4,100GHz,6.088778065749799
--- a/src/plots/tables/gemv_layers_3GHz.csv
+++ b/src/plots/tables/gemv_layers_3GHz.csv
@@ -1,5 +1,5 @@
 workload,level,frequency,speedup
-gemv_layers,X1,3GHz,3.194018430394461
+gemv_layers,X1,3GHz,2.992752194063702
-gemv_layers,X2,3GHz,6.206580081241512
+gemv_layers,X2,3GHz,11.246371082010572
-gemv_layers,X3,3GHz,11.305511591995977
+gemv_layers,X3,3GHz,34.94598413478715
-gemv_layers,X4,3GHz,20.27760945615218
+gemv_layers,X4,3GHz,72.33604077371677
--- a/src/plots/vector_infinite.tex
+++ b/src/plots/vector_infinite.tex
@@ -3,14 +3,15 @@
    \pgfplotstableread[col sep=comma]{plots/tables/vmul_100GHz.csv}\vmul
    \pgfplotstableread[col sep=comma]{plots/tables/haxpy_100GHz.csv}\haxpy
    \begin{axis}[
-            width=0.9\textwidth,
+            width=0.8\textwidth,
            ybar=1pt,
            bar width = 15pt,
            ymin=0,
            ymax=5,
-            ytick distance=1,
+            % ymode=log,
            % log origin=infty,
            ymajorgrids,
-            ylabel={Speedup},
+            ylabel={Relative Performance},
            tick pos=left,
            xtick=data,
            xticklabels from table={\vadd}{level},
--- a/src/plots/vector_normal.tex
+++ b/src/plots/vector_normal.tex
@@ -3,14 +3,16 @@
    \pgfplotstableread[col sep=comma]{plots/tables/vmul_3GHz.csv}\vmul
    \pgfplotstableread[col sep=comma]{plots/tables/haxpy_3GHz.csv}\haxpy
    \begin{axis}[
-            width=0.9\textwidth,
+            width=0.8\textwidth,
            ybar=1pt,
            bar width = 15pt,
            ymin=0,
            ymax=35,
-            minor y tick num = 5,
+            % ymode=log,
            % log origin=infty,
            % minor y tick num = 5,
            ymajorgrids,
-            ylabel={Speedup},
+            ylabel={Relative Performance},
            tick pos=left,
            xtick=data,
            xticklabels from table={\vadd}{level},