Update on Overleaf.
This commit is contained in:
@@ -298,7 +298,7 @@ A self-written kernel provides full control for implementing a minimal example u
|
||||
|
||||
\section{Simulations}
|
||||
Our simulations are based on the gem5 simulator and the DRAMSys memory simulator.
|
||||
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical host processor with infinite compute capacity.
|
||||
The comparison between non-\ac{pim} and \ac{pim} architectures considers a hypothetical ARM host processor with infinite compute capacity.
|
||||
In this ideal approach, memory bandwidth is the only limiting component, allowing only memory-bound effects to be considered.
|
||||
This provides a lower bound on the possible speedups \ac{pim} can achieve, independent of the host architecture, as the memory bound can only become less significant.
|
||||
The configuration of \ac{hbm2} DRAM is summarized in \cref{tab:memspec}.
|
||||
@@ -372,13 +372,16 @@ Therefore, there is a break-even point between dimensions X1 and X2 where \ac{pi
|
||||
\label{fig:speedups}
|
||||
\end{figure}
|
||||
|
||||
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a Xilinx Zynq Ultrascale+ FPGA and uses real manufactured \ac{fimdram} memory packages.
|
||||
Besides it's own virtual prototype, Samsung used a real hardware accelerator platform for its analyses, which is based on a high-end processor
|
||||
with 60 compute units and uses real manufactured \ac{fimdram} memory packages.
|
||||
Similar to the previous simulations, Samsung has used different input dimensions for its microbenchmarks for both its \ac{gemv} and its vector ADD workloads, which are equivalent.
|
||||
|
||||
The performed ADD microbenchmark of Samsung shows an average speedup of around $\qty{1.6}{\times}$ for the real system and \qty{2.6}{\times} for the virtual prototype.
|
||||
Compared to this paper, where the speedup is approximately $\qty{12.7}{\times}$, this result is almost an order of magnitude lower.
|
||||
Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance hit.
|
||||
However, this memory barrier has also been implemented in our VADD kernel, which still shows a significant performance gain.
|
||||
Samsung explains the low speedup by the fact the processor has to introduce memory barrier instructions, resulting in a severe performance degradation.
|
||||
However, this memory barrier has also been implemented in our VADD kernel.
|
||||
One possible explanation for the deviation could be architectural differences between the simulated ARM-based system and Samsung's GPU-based system.
|
||||
The simulated platform can speculatively execute instructions, which may result in better utilization of memory bandwidth.
|
||||
|
||||
The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's real system and \qty{2.6}{\times} for their virtual prototype, while this paper achieved an average speedup of $\qty{9.0}{\times}$, which is well within the reach of the real hardware implementation.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user