diff --git a/acronyms.tex b/acronyms.tex index 621f9da..81e84a8 100644 --- a/acronyms.tex +++ b/acronyms.tex @@ -135,3 +135,11 @@ short = ReLU, long = rectified linear unit, } +\DeclareAcronym{gpu}{ + short = GPU, + long = graphics processing unit, +} +\DeclareAcronym{fpga}{ + short = FPGA, + long = field-programmable gate array, +} diff --git a/samplepaper.tex b/samplepaper.tex index 6ecb5d6..a279966 100644 --- a/samplepaper.tex +++ b/samplepaper.tex @@ -100,17 +100,19 @@ In summary this paper makes the following contributions: The paper is structured as follows ... % \section{Related Work} -% TODO Derek -Onur Ramulator +% Onur Ramulator ? +To analyze the potential performance and power impact of Newton, SK Hynix developed a virtual prototype based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator, which models an \ac{hbm2} memory and the extended Newton \ac{dram} protocol. +The simulated system is compared to two different non-\ac{pim} systems: an ideal non-\ac{pim} host with infinite compute bandwidth and a \ac{gpu} model of a high-end Titan-V graphics card using a cycle-accurate \ac{gpu} simulator. +SK Hynix finds that Newton achieves a \qty{54}{\times} speedup over the Titan-V \ac{gpu} model and a speedup of \qty{10}{\times} for the ideal non-\ac{pim} case, setting a lower bound on the acceleration for every possible non-\ac{pim} architecture. -With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator. +With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} also based on DRAMSim2. PIMSimulator offers two simulation modes: it can either accept pre-recorded memory traces or generate very simplified memory traffic using a minimal host processor model that essentially executes only the \ac{pim}-related program regions. However, neither approach accurately models a complete system consisting of a host processor running a real compiled binary and the memory system that integrates \ac{fimdram}. As a result, only limited conclusions can be made about the performance impact of \ac{fimdram} and the changes that are required in the application code to support the new architecture. +In Samsung's findings, the simulated \ac{fimdram} system provides a speedup in the range of \qtyrange{2.1}{2.6}{\times} depending on the simulated workload with an average speedup of \qty{2.5}{\times} compared to standard \ac{hbm2} memory. \section{Background DRAM-PIM} \label{sec:dram_pim} -% TODO Derek Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}. As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity. In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth. @@ -171,7 +173,6 @@ With this method, the register indices and the bank address cannot get out of sy \section{VP} -% TODO Derek To build a virtual prototype of \aca{fimdram}, an accurate \ac{hbm2} model is needed, where the additional \ac{pim}-\acp{pu} are integrated. For this the cycle-accurate \ac{dram} simulator DRAMSys \cite{steiner2022a} has been used and its \ac{hbm2} model extended to incorporate the \acp{pu} into the \acp{pch} of the \ac{pim}-activated channels. The \aca{fimdram} model itself does not need to model any timing behavior: @@ -310,14 +311,24 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi \label{fig:speedups} \end{figure} -Vergleich mit Samsung... +Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \ac{fimdram} memory packages. +Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are consistent. + +The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype. +Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result almost an order of magnitude lower. +Samsung explains its low value by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hits. +However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain. + +The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation. + % TODO Derek \section{Conclusion} % TODO Lukas/Matthias % -\bibliographystyle{IEEEtran} % TODO change style? +% TODO teilweise doppelte Einträge! +\bibliographystyle{IEEEtran} \bibliography{references.bib} \end{document}