Update on Overleaf.

2024-04-05 08:49:37 +00:00
parent 1e98b3b48e
commit 92682a55d6
1 changed files with 7 additions and 4 deletions
--- a/samplepaper.tex
+++ b/samplepaper.tex
@@ -298,7 +298,7 @@ A self-written kernel provides full control for implementing a minimal example u

 \section{Simulations}
 Our simulations are based on the gem5 simulator and the DRAMSys memory simulator.
-The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity.
+The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical ARM host processor with infinite compute capacity.
 In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered.
 This provides a lower bound on the possible speedups \ac{pim} can achieve, independent of the host architecture, as the memory bound can only become less significant.
 The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}.
@@ -372,13 +372,16 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
    \label{fig:speedups}
 \end{figure}

-Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages.
+Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a high-end processor
+with 60 compute units and uses real manufactured \ac{fimdram} memory packages.
 Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are equivalent.

 The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
 Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result is almost an order of magnitude lower.
-Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hit.
-However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain.
+Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance degradation.
+However, this memory barrier has also been implemented in our VADD kernel.
+One possible explanation for the deviation could be architectural differences between the simulated ARM-based system and Samsung's GPU-based system.
+The simulated platform can speculatively execute instructions, which may result in better utilization of memory bandwidth.

 The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.