From 30db51a8de7ccbd5ab465a64ec5577b7482493ce Mon Sep 17 00:00:00 2001 From: Derek Christ Date: Fri, 1 Mar 2024 19:44:43 +0100 Subject: [PATCH] Samsung comparison --- src/chapters/pim.tex | 1 + src/chapters/results.tex | 69 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 64 insertions(+), 6 deletions(-) diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index ffb6e19..b9f5165 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -390,6 +390,7 @@ Therefore, it is necessary to execute the entire \ac{gemv} microkernel several t In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed. \subsubsection{Performance and Power Efficiency Effects} +\label{sec:fimdram_performance} In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements. This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that is integrated onto the same silicon interposer as four \aca{hbm} stacks, with each consisting of one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}. diff --git a/src/chapters/results.tex b/src/chapters/results.tex index 6bd803b..ba05b75 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -159,10 +159,10 @@ The additional computation step of the scalar multiplication does not affect the % GEMM mit stark interleavten matrizen (eher nicht) -In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also analyzed for the level 2 \ac{blas} routine \ac{gemv}. +In addition to the vector operations and the level 1 \ac{blas} routine \ac{haxpy}, the performance improvement of \ac{pim} is also investigated for the level 2 \ac{blas} routine \ac{gemv}. Besides the regular \ac{gemv} operation, whose form is $y = A \cdot x$, several matrix-vector multiplications are chained together with the activation function \ac{relu} applied in between, modeling a simple fully connected neural network. Each processing step for a \ac{dnn} layer can be described as $y = \textrm{ReLU}(A \cdot x)$, where the output of the operation is fed as input to the next layer. -In the simplest form, quadratic matrix dimensions ensure that the output vector has the same dimensions as the input vector, which simplifies chaining in the benchmark. +In the simplest form, quadratic matrix dimensions ensure that the output vector of each layer has the same dimensions as the input vector, which simplifies the chaining in the benchmark. Again, several different dimensions of the benchmark inputs are used, whose matrix dimensions for each of the two benchmarks are given in \cref{tab:dimensions_matrix}. \begin{table} @@ -191,31 +191,88 @@ X4 & (1024 $\times$ 128) & (1024 $\times$ 1024) \label{tab:dimensions_matrix} \end{table} -\begin{figure} +In the \ac{gemv} benchmarks, only the number of rows is increased at each step, which means that the \ac{pim} microkernel has to perform more iterations of the \ac{mac} kernel, but does not have to load another chunk of the input vector, since it fits completely into the \ac{grf}-A registers. + +\begin{figure}[ht] \centering \input{plots/matrix_normal} - \caption{Normal} + \caption{Comparison between non-\ac{pim} and \ac{pim} for the \ac{gemv} benchmarks running at a \ac{cpu} frequency of $\qty{3}{\giga\hertz}$.} \label{fig:matrix_normal} \end{figure} +\Cref{fig:matrix_normal} shows the relative performance for the \ac{gemv} benchmarks that are run on the system at a normal clock speed. +The speedup for a single \ac{gemv} operation is in the range of $\qtyrange{3.5}{23.6}{\times}$ and for the simple \ac{dnn} layers $\qtyrange{3.0}{72.3}{\times}$. +Unlike in the vector benchmarks, the performance gains become drastically more significant with increasing matrix dimensions, where \ac{pim} can exploit its specialized architecture for this type of operation. +A possible explanation is that the initial overhead of executing the microkernel in the \aca{fimdram} processing units quickly becomes insignificant with increasing operand dimensions compared to the actual execution time. +Also, in all cases, the smallest representable operand dimensions already achieve a speedup of over one, suggesting that the break-even point of \ac{pim}'s viability for this system is below these dimensions. +Since the speedup approaches $\qty{100}{\times}$ in the \ac{dnn} benchmark, it can be concluded that \ac{pim} offers an immense performance advantage in this system configuration. + \begin{figure} \centering \input{plots/matrix_infinite} - \caption{Infinite Compute} + \caption{Comparison between non-\ac{pim} and \ac{pim} for the \ac{gemv} benchmarks running on the infinite compute platform.} \label{fig:matrix_infinite} \end{figure} +The \ac{gemv} and \ac{dnn} benchmarks, however show a more differentiated view for the infinite compute approach that models the completely memory-bounded system: +For smaller matrix dimensions, the usage of \ac{pim} slows the execution down up to a factor of $\qty{0.21}{\times}$ for the \ac{gemv} benchmark and even $\qty{0.18}{\times}$ for the \ac{dnn} layers. +However, the speedup quickly increases with the larger dimensions, reaches its break-even point at the third step and shows a maximum speedup of $\qty{4.7}{\times}$ and $\qty{6.1}{\times}$ for the \ac{gemv} and \ac{dnn} benchmark respectively. +These results provide a more realistic view of \aca{fimdram}: +For workloads and accelerator systems that are truly memory-bound, performance improvements can be on the order of the simulated $\qty{6.1}{\times}$. +This result is largely in line with the numbers published by Samsung, which were already introduced in \cref{sec:fimdram_performance} and will be compared in more detail with the simulation results in the next section. + \subsubsection{Comparison to Samsung's Simulation Results} +To reiterate, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \aca{fimdram} memory packages. +Similarly to the above investigations, Samsung used for its microbenchmarks different input dimensions for both its \ac{gemv} and vector ADD workloads, which are listed in \cref{tab:samsung_dimensions}. + +\begin{table} +\centering +\begin{tblr}{ + cell{2}{2} = {r}, + cell{3}{2} = {r}, + cell{4}{2} = {r}, + cell{5}{2} = {r}, + cell{2}{3} = {r}, + cell{3}{3} = {r}, + cell{4}{3} = {r}, + cell{5}{3} = {r}, + hlines, + vlines, + hline{2} = {-}{solid,black}, + hline{2} = {2}{-}{solid,black}, +} +Level & \ac{gemv} Dimensions & ADD Dimensions \\ +Level 1 & (1k $\times$ 4k) & (2M) \\ +Level 2 & (2k $\times$ 4k) & (4M) \\ +Level 3 & (4k $\times$ 8k) & (8M) \\ +Level 4 & (8k $\times$ 8k) & (16M) +\end{tblr} +\caption{List of the operand dimensions for the microbenchmarks used by Samsung \cite{lee2021}.} +\label{tab:samsung_dimensions} +\end{table} + +Each simulation is run with different batch sizes, where a higher batch size allows for better cache utilization, as multiple operations are performed on the same data set, making the workload less memory bound and rendering \ac{pim} less effective. +All the microbenchmarks discussed so far do not perform batching, so all comparisons are performed on the result values for the batch size of 1, which correspond with the blue bars in \cref{fig:samsung_speedup}. +Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the limited \ac{cpu} approach, as both systems should operate in the memory-bounded region. + \begin{figure} \centering \includegraphics[width=0.8\linewidth]{plots/samsung} - \caption{Samsung} + \caption{Relative performance of the \ac{gemv} and ADD microbenchmark for different batch sizes \cite{lee2021}.} \label{fig:samsung_speedup} \end{figure} +The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. +When compared to the simulated platform, the variance is also limited with a range of $\qtyrange{1.6}{2.4}{\times}$, which corresponds well with the findings of Samsung. +The \ac{gemv} microbenchmark on the other hand shows a more drastic speedup with an average value of $\qty{8.3}{\times}$. +Although the dimensions used by Samsung are different from the simulations of this thesis, the highest achieved speedup of $\qty{6.1}{\times}$ is well within the reach of the real hardware implementation. + \subsubsection{Comparison to Real Hardware} +In addition to the comparison of Samsung's real hardware implementation, the same benchmarks of the performed simulations are run on a [...] with HBM2 [...]. +As this system is using a generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case. + % \subsubsection{Initialization Overhead} % conversion der operanden im verhältnis zur laufzeit abschätzen