bachelor-thesis/inc/7.simulation_results.tex

\section{Simulation Results}
\label{sec:simulation_results}

In this section the accuracy of the new simulation frontend will be evaluated.
After a short discussion about the general expections regarding the accuracy and considerations to make, the simulation results will be presented.
The presentation is structured into two parts:
At first simulation statistics of numerous benchmarks are compared against the gem5\cite{Binkert2011} simulator that uses detailed processor models and can be considered as a reference.
Secondly, the new simulation frontend is compared against the memory access trace generator tool of the Ramulator DRAM simulator\cite{Ghose2019}.

\subsection{Accuracy}
Generating memory access traces using dynamic binary instrumentation as a faster alternative to the simulation of detailed processor models introduces several inaccuracies, which of some will now be enumerated.

The most important aspect to consider is that DBI can only instrument the target application but fails to also take the operating system the application is running on into account.
That includes the inability to observe the execution of kernel routines that are directly invoked by the target application through system calls, but also the preemtive scheduling of other programs that are running on the system at the same time.

What is also to concern is the fetching of the instructions itself:
In a real system the binary executable of the target application is placed in the DRAM, along with its data, and gets fetched into the instruction cache while executing.
Since the DBI cannot observe the fetching of those instructions, the new simulator frontend cannot model this memory traffic.

\subsection{Comparison to the gem5 Simulator}

At first, the micro-benchmark suite TheBandwithBenchmark\cite{TheBandwidthBenchmark}, containing various streaming kernels, will be used to compare the gem5 full-system simulation as well as the gem5 syscall-emulation simulation with the newly developed frontend.
The simulation setup consists in both cases of a two-level cache hierarchy with the following parameters:

\begin{table}[!ht]
\caption{Cache parameters.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 Cache & Size & Associativity & Line size & MSHRs & MSHR targets & WB entries\\
 \hline
 \hline
 L1 & 32 kiB & 8 & 64 & 4 & 20 & 8\\
 \hline
 L2 & 256 kiB & 4 & 64 & 20 & 12 & 8\\
 \hline

\end{tabular}
\end{center}
\label{tab:cache_parameters}
\end{table}

In this configuration, every processor core has its own L1 data cache (in case of gem5 also a L1 instruction cache) whereas the L2 cache is shared between all cores.
The gem5 simulator uses four ARM CPU core models (TimingSimpleCPU, an in-order core model) at 1 GHz, whereas the DynamoRIO traces are obtained using a QEMU\cite{Qemu} ARM virtual machine, configured to use four cores as well.
The DRAM subsystem will be varied between a single-channel DDR3 memory (1600 MT/s) and a single-channel DDR4 memory (2400 MT/s).
% Hier die DRAMSys Configuration erklären!
To match the same configuration as in gem5, the memory controller in DRAMSys is set to use a \revabbr{first-ready - first-come, first-served}{FR-FCFS} scheduling policy, a \revabbr{first-in, first-out}{FIFO} response queue policy, and a row-rank-bank-column-channel address mapping (explained in more detail in appendix \ref{sec:address_mappings}).
The trace player operates at the same clock frequency as the gem5 core models.

The micro-benchmarks itself are multi-threaded and use all four cores.
Their access patterns are as followed:

\begin{table}[!ht]
\caption{Access patterns of the micro-benchmark kernels\cite{TheBandwidthBenchmark}.}
\begin{center}
\begin{tabular}{|c|c|c|}
 \hline
 Benchmark kernel & Description & Access pattern \\
 \hline
 \hline
 INIT & Initialize an array & a = s (store, write allocate) \\
 \hline
 SUM & Vector reduction & s += a (load)\\
 \hline
 COPY & Memory copy & a = b (load, store, write allocate)\\
 \hline
 UPDATE & Update vector & a = a * scalar (load, store)\\
 \hline
 TRIAD & Stream triad & a = b + c * scalar (load, store, write allocate)\\
 \hline
 DAXPY & Daxpy & a = a + b * scalar (load, store)\\
 \hline
 STRIAD & Schönauer triad & a = b + c * d (load, store, write allocate)\\
 \hline
 SDAXPY & Schönauer triad & a = a + b * c (load, store)\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_description}
\end{table}

In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented.

\begin{table}[!ht]
\caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600. FS denotes gem5 full-system, SE denotes gem5 syscall-emulation, DS denotes DRAMSys.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2.031 & 2.698& 4 & 238.3 & 268.8& 7 & 140.3 & 134.3 & 10\\
 \hline
 DAXPY & 2.070 & 2.627& 4 & 238.2 & 268.9 & 7 & 140.2 & 134.4 & 10\\
 \hline
 INIT & 2.028 & 2.629& 4 & 141.9 & 172.9 & 7 & 140.1 & 134.4 & 10\\
 \hline
 SDAXPY & 2.101 & 2.755& 4 & 335.1 & 364.8 & 7 & 140.4 & 134.4 & 10\\
 \hline
 STRIAD & 2.228 & 2.613& 4 & 431.6& 460.9 & 7 & 140.4 & 134.4 & 10\\
 \hline
 SUM & 1.393 & 1.969& 4 & 142.0 & 172.9 & 7 & 44.1 & 38.5 & 10\\
 \hline
 TRIAD & 2.162 & 2.725& 4 & 335.1 & 364.9 & 7 & 140.4 & 134.4 & 10\\
 \hline
 UPDATE & 1.938 & 2.528& 4 & 142.0& 172.8 & 7 & 140.1 & 134.3 & 10\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_bandwidth_ddr3}
\end{table}

\begin{table}[!ht]
\caption{Results for bandwidth and bytes read/written with DDR4-2400.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_bandwidth_ddr4}
\end{table}

Tables \ref{tab:benchmark_bandwidth_ddr3} and \ref{tab:benchmark_bandwidth_ddr4}

\begin{table}[!ht]
\caption{Results for memory access latency and data bus utilization with DDR3-1600.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Access Latency [ns]} & \multicolumn{3}{|c|}{Data Bus Utilization [\%]} \\
 \cline{2-7}
 & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_access_ddr3}
\end{table}

\begin{table}[!ht]
\caption{Results for memory access latency and data bus utilization with DDR4-2400.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Access Latency [ns]} & \multicolumn{3}{|c|}{Data Bus Utilization [\%]} \\
 \cline{2-7}
 & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_access_ddr4}
\end{table}

\begin{table}[!ht]
\caption{Results last-level cache (L2) statistics with DDR3-1600.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Hits} & \multicolumn{3}{|c|}{Misses} & \multicolumn{3}{|c|}{Miss Rate [\%]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_cache_ddr3}
\end{table}

\begin{table}[!ht]
\caption{Results last-level cache (L2) statistics with DDR4-2400.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Hits} & \multicolumn{3}{|c|}{Misses} & \multicolumn{3}{|c|}{Miss Rate [\%]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_cache_ddr4}
\end{table}

% \subsubsection{New simulation frontend}
%
% \subsubsection{gem5 full-system mode}
%
% \subsubsection{gem5 syscall-emulation mode}


\subsection{Comparison to Ramulator}

\subsection{Simulation Runtime}