diff --git a/samplepaper.tex b/samplepaper.tex index 56a5290..e9bb258 100644 --- a/samplepaper.tex +++ b/samplepaper.tex @@ -298,7 +298,7 @@ A self-written kernel provides full control for implementing a minimal example u \section{Simulations} Our simulations are based on the gem5 simulator and the DRAMSys memory simulator. -The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity. +The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical ARM host processor with infinite compute capacity. In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered. This provides a lower bound on the possible speedups \ac{pim} can achieve, independent of the host architecture, as the memory bound can only become less significant. The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}. @@ -372,13 +372,16 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi \label{fig:speedups} \end{figure} -Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages. +Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a high-end processor +with 60 compute units and uses real manufactured \ac{fimdram} memory packages. Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are equivalent. The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype. Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result is almost an order of magnitude lower. -Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hit. -However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain. +Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance degradation. +However, this memory barrier has also been implemented in our VADD kernel. +One possible explanation for the deviation could be architectural differences between the simulated ARM-based system and Samsung's GPU-based system. +The simulated platform can speculatively execute instructions, which may result in better utilization of memory bandwidth. The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.