More related work and comparison to Samsungs results

2024-03-19 11:33:49 +01:00
parent c7c3847718
commit fb92d25fa8
2 changed files with 26 additions and 7 deletions
--- a/acronyms.tex
+++ b/acronyms.tex
@@ -135,3 +135,11 @@
    short = ReLU,
    long = rectified linear unit,
 }
+\DeclareAcronym{gpu}{
+    short = GPU,
+    long = graphics processing unit,
+}
+\DeclareAcronym{fpga}{
+    short = FPGA,
+    long = field-programmable gate array,
+}
--- a/samplepaper.tex
+++ b/samplepaper.tex
@@ -100,17 +100,19 @@ In summary this paper makes the following contributions:
 The paper is structured as follows ...
 %
 \section{Related Work}
-% TODO Derek
-Onur Ramulator
+% Onur Ramulator ?
+To analyze the potential performance and power impact of Newton, SK Hynix developed a virtual prototype based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator, which models an \ac{hbm2} memory and the extended Newton \ac{dram} protocol.
+The simulated system is compared to two different non-\ac{pim} systems: an ideal non-\ac{pim} host with infinite compute bandwidth and a \ac{gpu} model of a high-end Titan-V graphics card using a cycle-accurate \ac{gpu} simulator.
+SK Hynix finds that Newton achieves a \qty{54}{\times} speedup over the Titan-V \ac{gpu} model and a speedup of \qty{10}{\times} for the ideal non-\ac{pim} case, setting a lower bound on the acceleration for every possible non-\ac{pim} architecture.

-With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator.
+With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} also based on DRAMSim2.
 PIMSimulator offers two simulation modes: it can either accept pre-recorded memory traces or generate very simplified memory traffic using a minimal host processor model that essentially executes only the \ac{pim}-related program regions.
 However, neither approach accurately models a complete system consisting of a host processor running a real compiled binary and the memory system that integrates \ac{fimdram}.
 As a result, only limited conclusions can be made about the performance impact of \ac{fimdram} and the changes that are required in the application code to support the new architecture.
+In Samsung's findings, the simulated \ac{fimdram} system provides a speedup in the range of \qtyrange{2.1}{2.6}{\times} depending on the simulated workload with an average speedup of \qty{2.5}{\times} compared to standard \ac{hbm2} memory.

 \section{Background DRAM-PIM}
 \label{sec:dram_pim}
-% TODO Derek
 Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
 As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
 In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
@@ -171,7 +173,6 @@ With this method, the register indices and the bank address cannot get out of sy


 \section{VP}
-% TODO Derek
 To build a virtual prototype of \aca{fimdram}, an accurate \ac{hbm2} model is needed, where the additional \ac{pim}-\acp{pu} are integrated.
 For this the cycle-accurate \ac{dram} simulator DRAMSys \cite{steiner2022a} has been used and its \ac{hbm2} model extended to incorporate the \acp{pu} into the \acp{pch} of the \ac{pim}-activated channels.
 The \aca{fimdram} model itself does not need to model any timing behavior:
@@ -310,14 +311,24 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
    \label{fig:speedups}
 \end{figure}

-Vergleich mit Samsung...
+Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \ac{fimdram} memory packages.
+Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are consistent.
+
+The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
+Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result almost an order of magnitude lower.
+Samsung explains its low value by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hits.
+However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain.
+
+The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.
+
 % TODO Derek

 \section{Conclusion}
 % TODO Lukas/Matthias
 %

-\bibliographystyle{IEEEtran} % TODO change style?
+% TODO teilweise doppelte Einträge!
+\bibliographystyle{IEEEtran}
 \bibliography{references.bib}

 \end{document}