Update on Overleaf.

This commit is contained in:
2024-04-05 08:49:37 +00:00
committed by node
parent 1e98b3b48e
commit 92682a55d6

View File

@@ -298,7 +298,7 @@ A self-written kernel provides full control for implementing a minimal example u
\section{Simulations}
Our simulations are based on the gem5 simulator and the DRAMSys memory simulator.
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity.
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical ARM host processor with infinite compute capacity.
In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered.
This provides a lower bound on the possible speedups \ac{pim} can achieve, independent of the host architecture, as the memory bound can only become less significant.
The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}.
@@ -372,13 +372,16 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
\label{fig:speedups}
\end{figure}
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages.
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a high-end processor
with 60 compute units and uses real manufactured \ac{fimdram} memory packages.
Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are equivalent.
The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result is almost an order of magnitude lower.
Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hit.
However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain.
Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance degradation.
However, this memory barrier has also been implemented in our VADD kernel.
One possible explanation for the deviation could be architectural differences between the simulated ARM-based system and Samsung's GPU-based system.
The simulated platform can speculatively execute instructions, which may result in better utilization of memory bandwidth.
The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.