More related work and comparison to Samsungs results

This commit is contained in:
2024-03-19 11:33:49 +01:00
parent c7c3847718
commit fb92d25fa8
2 changed files with 26 additions and 7 deletions

View File

@@ -135,3 +135,11 @@
short = ReLU,
long = rectified linear unit,
}
\DeclareAcronym{gpu}{
short = GPU,
long = graphics processing unit,
}
\DeclareAcronym{fpga}{
short = FPGA,
long = field-programmable gate array,
}

View File

@@ -100,17 +100,19 @@ In summary this paper makes the following contributions:
The paper is structured as follows ...
%
\section{Related Work}
% TODO Derek
Onur Ramulator
% Onur Ramulator ?
To analyze the potential performance and power impact of Newton, SK Hynix developed a virtual prototype based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator, which models an \ac{hbm2} memory and the extended Newton \ac{dram} protocol.
The simulated system is compared to two different non-\ac{pim} systems: an ideal non-\ac{pim} host with infinite compute bandwidth and a \ac{gpu} model of a high-end Titan-V graphics card using a cycle-accurate \ac{gpu} simulator.
SK Hynix finds that Newton achieves a \qty{54}{\times} speedup over the Titan-V \ac{gpu} model and a speedup of \qty{10}{\times} for the ideal non-\ac{pim} case, setting a lower bound on the acceleration for every possible non-\ac{pim} architecture.
With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} based on the DRAMSim2 \cite{rosenfeld2011} cycle-accurate memory simulator.
With the \textbf{PIMSimulator} \cite{shin-haengkang2023}, Samsung provides a virtual prototype of \ac{fimdram} also based on DRAMSim2.
PIMSimulator offers two simulation modes: it can either accept pre-recorded memory traces or generate very simplified memory traffic using a minimal host processor model that essentially executes only the \ac{pim}-related program regions.
However, neither approach accurately models a complete system consisting of a host processor running a real compiled binary and the memory system that integrates \ac{fimdram}.
As a result, only limited conclusions can be made about the performance impact of \ac{fimdram} and the changes that are required in the application code to support the new architecture.
In Samsung's findings, the simulated \ac{fimdram} system provides a speedup in the range of \qtyrange{2.1}{2.6}{\times} depending on the simulated workload with an average speedup of \qty{2.5}{\times} compared to standard \ac{hbm2} memory.
\section{Background DRAM-PIM}
\label{sec:dram_pim}
% TODO Derek
Many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the DRAM can provide, making them \textit{memory-bound} \cite{he2020}.
As already discussed in \cref{sec:intro}, PIM is a good fit for accelerating memory-bound workloads with low operational intensity.
In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
@@ -171,7 +173,6 @@ With this method, the register indices and the bank address cannot get out of sy
\section{VP}
% TODO Derek
To build a virtual prototype of \aca{fimdram}, an accurate \ac{hbm2} model is needed, where the additional \ac{pim}-\acp{pu} are integrated.
For this the cycle-accurate \ac{dram} simulator DRAMSys \cite{steiner2022a} has been used and its \ac{hbm2} model extended to incorporate the \acp{pu} into the \acp{pch} of the \ac{pim}-activated channels.
The \aca{fimdram} model itself does not need to model any timing behavior:
@@ -310,14 +311,24 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
\label{fig:speedups}
\end{figure}
Vergleich mit Samsung...
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ \ac{fpga} and uses real manufactured \ac{fimdram} memory packages.
Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are consistent.
The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result almost an order of magnitude lower.
Samsung explains its low value by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hits.
However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain.
The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.
% TODO Derek
\section{Conclusion}
% TODO Lukas/Matthias
%
\bibliographystyle{IEEEtran} % TODO change style?
% TODO teilweise doppelte Einträge!
\bibliographystyle{IEEEtran}
\bibliography{references.bib}
\end{document}