bachelor-thesis/inc/7.simulation_results.tex

\section{Simulation Results}
\label{sec:simulation_results}

This section evaluates the accuracy of the new simulation front-end.
After a short discussion about the general expections regarding the accuracy and considerations to make, the simulation results will be presented.
The presentation is structured into two parts:
At first simulation statistics of numerous benchmarks are compared against the gem5 \cite{Binkert2011} simulator, which uses detailed processor models and can be considered as a reference.
Secondly, the new simulation frontend is compared against the memory access trace generator tool of the Ramulator DRAM simulator \cite{Ghose2019}.

\subsection{Accuracy}
Generating memory access traces using dynamic binary instrumentation as a faster alternative to the simulation of detailed processor models introduces several inaccuracies, which of some will now be enumerated.

The most important aspect to consider is that DBI can only instrument the target application but fails to also take the operating system the application is running on into account.
That includes the inability to observe the execution of kernel routines that are directly invoked by the application through system calls, but also the preemtive scheduling of other programs that are running on the system at the same time.

The fetching of the instructions themselves should also be considered:
In a real system the binary executable of the target application is placed in the DRAM, along with its data, and is loaded into the instruction cache while executing.
Since the DBI cannot observe the fetching of those instructions, the new simulator frontend cannot model this memory traffic.

\subsection{Comparison to the gem5 Simulator}

At first, the micro-benchmark suite TheBandwithBenchmark \cite{TheBandwidthBenchmark}, consisting of various streaming kernels, will be used to compare the gem5 full-system simulation as well as the gem5 syscall-emulation simulation with the newly developed frontend.

The gem5 syscall-emulation does not simulate a whole operating system, rather it utilizes the host system's Linux kernel and therefore only simulates the binary application.
In contrast, the gem5 full-system simulation boots into a complete Linux system including all processes, that may run in the background.
Therefore, syscall-emulation is conceptually closer to the DynamoRIO approach than full-system simulation.

The simulation setup consists in both cases of a two-level cache hierarchy with the following parameters:

\begin{table}[!ht]
\caption{Cache parameters used in simulations.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 Cache & Size & Associativity & Line size & MSHRs & MSHR targets & WB entries\\
 \hline
 \hline
 L1 & 32 kiB & 8 & 64 & 4 & 20 & 8\\
 \hline
 L2 & 256 kiB & 4 & 64 & 20 & 12 & 8\\
 \hline

\end{tabular}
\end{center}
\label{tab:cache_parameters}
\end{table}

In this configuration, every processor core has its own L1 data cache (in case of gem5 also a L1 instruction cache) whereas the L2 cache is shared between all cores.
The gem5 simulator uses four ARM CPU core models (TimingSimpleCPU, an in-order core model) at \textit{1000 MHz}, whereas the DynamoRIO traces are obtained using a QEMU \cite{Qemu} ARM virtual machine, configured to use four cores as well.
The DRAM subsystem will be varied between a single-channel DDR3 memory (1600 MT/s) and a single-channel DDR4 memory (2400 MT/s).
% Hier die DRAMSys Configuration erklären!
To match the same configuration as in gem5, the memory controller in DRAMSys is set to use a \revabbr{first-ready - first-come, first-served}{FR-FCFS} scheduling policy, a \revabbr{first-in, first-out}{FIFO} response queue policy, and a row-rank-bank-column-channel address mapping (explained in more detail in Appendix \ref{sec:address_mappings}).
The trace player operates at the same clock frequency as the gem5 core models.

It is important to configure the CPI value of the new trace player to a sensible value to approximate the delay between two consecutive memory accesses.
For the simulations, the CPI value that gem5 SE reports in its statistics is used.
It has been found that the CPI results in an approximate value of \textit{10} if only computation instructions are considered and load and store operations are ignored, since those are affected by the latency of the memory subsystem.

The micro-benchmarks itself are multi-threaded and make use of all available cores.
Furthermore, the compiler optimizations are set to \texttt{-Ofast} for all benchmarks.
Their access patterns are as followed:

\begin{table}[!ht]
\caption{Access patterns of the micro-benchmark kernels \cite{TheBandwidthBenchmark}.}
\begin{center}
\begin{tabular}{|c|c|c|}
 \hline
 Kernel & Description & Access Pattern \\
 \hline
 \hline
 INIT & Initialize an array & a = s (store, write allocate) \\
 \hline
 SUM & Vector reduction & s += a (load)\\
 \hline
 COPY & Memory copy & a = b (load, store, write allocate)\\
 \hline
 UPDATE & Update vector & a = a * scalar (load, store)\\
 \hline
 TRIAD & Stream triad & a = b + c * scalar (load, store, write allocate)\\
 \hline
 DAXPY & Daxpy & a = a + b * scalar (load, store)\\
 \hline
 STRIAD & Schönauer triad & a = b + c * d (load, store, write allocate)\\
 \hline
 SDAXPY & Schönauer triad & a = a + b * c (load, store)\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_description}
\end{table}

In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented.

\begin{table}[!ht]
\caption{Results for bandwidth and bytes read/written with DDR4-2400.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline

 COPY & 2.201 & 2.794 & 2.130 & 238.4 & 268.8 & 307.8 & 140.2 & 134.3 & 134.4 \\
 \hline
 DAXPY & 2.157 & 2.721 & 1.600 & 238.2 & 268.8 & 302.0 & 140.2 & 134.4 & 134.4 \\
 \hline
 INIT & 2.058 & 2.737 & 2.040 & 141.9 & 172.6 & 216.1 & 140.0 & 134.1 & 134.4 \\
 \hline
 SDAXPY & 2.239 & 2.813 & 2.080 & 335.1 & 364.8 & 403.0 & 140.3 & 134.4 & 134.4 \\
 \hline
 STRIAD & 2.246 & 2.803 & 2.350 & 335.1 & 460.9 & 494.4 & 140.4 & 134.4 & 134.4 \\
 \hline
 SUM & 1.429 & 1.982 & 1.110 & 142.0 & 172.7 & 189.1 & 44.0 & 38.4 & 38.5 \\
 \hline
 TRIAD & 2.246 & 2.853 & 2.110 & 335.1 & 364.9 & 402.6 & 140.4 & 134.4 & 134.4 \\
 \hline
 UPDATE & 1.995 & 2.611 & 1.430 & 142.0 & 172.7 & 220.0 & 140.1 & 134.2 & 134.4 \\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_gem5_bandwidth_ddr4}
\end{table}

Listed in Table \ref{tab:benchmark_gem5_bandwidth_ddr4} are three key parameters, specifically the average memory bandwidth and the number of bytes that has been read or written for the DDR4-2400 configuration.
The results show that all parameters of DRAMSys correlate well with the gem5 statistics.
While for the average bandwidth the DynamoRIO results are on average 31.0\% slower compared to gem5 SE, this deviation is only 11.1\% for gem5 FS.
The numbers for the total amount of bytes read result in a deviation of 35.5\% in comparison to gem5 FS and only to 14.6\% to gem5 SE.
The amount of bytes written, on the other hand, shows a very small deviation of 5.2\% for gem5 FS and only 0.07\% for gem5 SE.
Therefore, it can be stated that almost the same number of bytes were written back to the DRAM due to cache write-backs.

Those numbers are also illustrated in Figure \ref{fig:benchmark_gem5_bandwidth_ddr4}.

\begin{figure}[!ht]
\begin{center}
\begin{tikzpicture}
\begin{axis}[
    width=\textwidth-0.5cm,
    ybar=1pt,
    bar width = 8pt,
    ymin=0,
    ymajorgrids,
    yminorgrids,
    ylabel={Avg. Bandwidth [GB/s]},
    symbolic x coords = {COPY, DAXPY, INIT, SDAXPY, STRIAD, SUM, TRIAD, UPDATE},
    legend style={
        at={(current bounding box.south-|current axis.south)},
        anchor=north,
        legend columns=-1,
        draw=none,
        /tikz/every even column/.append style={column sep=0.5cm}
    },
    x tick label style={/pgf/number format/1000 sep=},
    x tick label style={rotate=90,anchor=east},
    enlargelimits=0.075,
]

    \addplot
        coordinates {(COPY,2.201) (DAXPY,2.157) (INIT,2.058) (SDAXPY,2.239) (STRIAD,2.246) (SUM,1.429) (TRIAD,2.246) (UPDATE,1.995)};
    \addplot
        coordinates {(COPY,2.794) (DAXPY,2.721) (INIT,2.737) (SDAXPY,2.813) (STRIAD,2.803) (SUM,1.982) (TRIAD,2.853) (UPDATE,2.611)};
    \addplot
        coordinates {(COPY,2.130) (DAXPY,1.600) (INIT,2.040) (SDAXPY,2.080) (STRIAD,2.350) (SUM,1.110) (TRIAD,2.110) (UPDATE,1.430)};

    \legend{gem5 FS,gem5 SE,DRAMSys}
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Average Bandwidth with DDR4-2400.}
\label{fig:benchmark_gem5_bandwidth_ddr4}
\end{figure}

\begin{table}[!ht]
\caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600. FS denotes gem5 full-system, SE denotes gem5 syscall-emulation, DS denotes DRAMSys.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2.031 & 2.698& 2.160 & 238.3 & 268.8& 310.1 & 140.3 & 134.3 & 134.4\\
 \hline
 DAXPY & 2.070 & 2.627& 1.610 & 238.2 & 268.9 & 301.9 & 140.2 & 134.4 & 134.4\\
 \hline
 INIT & 2.028 & 2.629& 2.070 & 141.9 & 172.9 & 216.0 & 140.1 & 134.4 & 134.4\\
 \hline
 SDAXPY & 2.101 & 2.755& 2.110 & 335.1 & 364.8 & 404.0 & 140.4 & 134.4 & 134.4\\
 \hline
 STRIAD & 2.228 & 2.613& 2.370 & 431.6& 460.9 & 494.7 & 140.4 & 134.4 & 134.4\\
 \hline
 SUM & 1.393 & 1.969& 1.120 & 142.0 & 172.9 & 189.1 & 44.1 & 38.5 & 38.5\\
 \hline
 TRIAD & 2.162 & 2.725& 2.140 & 335.1 & 364.9 & 403.8 & 140.4 & 134.4 & 134.4\\
 \hline
 UPDATE & 1.938 & 2.528& 1.430 & 142.0& 172.8 & 220.0 & 140.1 & 134.3 & 134.4\\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_gem5_bandwidth_ddr3}
\end{table}

\begin{figure}[!ht]
\begin{center}
\begin{tikzpicture}
\begin{axis}[
    width=\textwidth-0.5cm,
    ybar=1pt,
    bar width = 8pt,
    ymin=0,
    ymajorgrids,
    yminorgrids,
    ylabel={Avg. Bandwidth [GB/s]},
    symbolic x coords = {COPY, DAXPY, INIT, SDAXPY, STRIAD, SUM, TRIAD, UPDATE},
    legend style={
        at={(current bounding box.south-|current axis.south)},
        anchor=north,
        legend columns=-1,
        draw=none,
        /tikz/every even column/.append style={column sep=0.5cm}
    },
    x tick label style={/pgf/number format/1000 sep=},
    x tick label style={rotate=90,anchor=east},
    enlargelimits=0.075,
]
    \addplot
        coordinates {(COPY,2.031) (DAXPY,2.070) (INIT,2.028) (SDAXPY,2.101) (STRIAD,2.276) (SUM,1.393) (TRIAD,2.162) (UPDATE,1.938)};
    \addplot
        coordinates {(COPY,2.698) (DAXPY,2.627) (INIT,2.629) (SDAXPY,2.755) (STRIAD,2.613) (SUM,1.969) (TRIAD,2.725) (UPDATE,2.528)};
    \addplot
        coordinates {(COPY,2.160) (DAXPY,1.610) (INIT,2.070) (SDAXPY,2.110) (STRIAD,2.370) (SUM,1.120) (TRIAD,2.140) (UPDATE,1.430)};

    \legend{gem5 FS,gem5 SE,DRAMSys}
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Average Bandwidth with DDR3-1600.}
\label{fig:benchmark_gem5_bandwidth_ddr3}
\end{figure}

Table \ref{tab:benchmark_gem5_bandwidth_ddr3} and Figure \ref{fig:benchmark_gem5_bandwidth_ddr3} show those same key parameters for the DDR3 configuration.
Here, the absolute deviations in the average memory bandwidth amount to 27.5\% and 7.0\% for gem5 SE and gem5 FS respectively.
The differences for the amount of bytes read result to 31.6\% for gem5 FS and to 14.7\% to gem5 SE.
Also here, the bytes written only show small deviations of 5.2\% for gem5 FS and 0.02\% for gem5 SE.

It has to be noted that the average memory bandwidth for the new trace player is highly influenced by the configured CPI value.
So to match a real system, this value has to be chosen wisely to achieve good simulation results for the memory bandwidth.


% Latency und simulation time

Another important metric in the evaluation of a memory subsystem is the average response latency of a memory access.
In Figure \ref{fig:latency_ddr4}, the average latencies of the DRAM are illustrated for the DDR4-2400 configuration.

While the latencies reported by DRAMSys are always higher for the respective benchmark, it averages to a deviation of 36.0\% in comparison to gem5 SE and to 24.9\% to gem5 FS.

Those numbers can be looked up in greater detail in Table \ref{tab:benchmark_gem5_access_ddr3} for the DDR3-1600 and in Table \ref{tab:benchmark_gem5_access_ddr4} for the DDR4-2400 configuration.
These tables also provide information about the simulation time of the different benchmarks.

\begin{figure}[!ht]
\begin{center}
\begin{tikzpicture}
\begin{axis}[
    width=\textwidth-0.5cm,
    ybar=1pt,
    bar width = 8pt,
    ymin=0,
    ymajorgrids,
    yminorgrids,
    ylabel={Avg. Latency [ns]},
    symbolic x coords = {COPY, DAXPY, INIT, SDAXPY, STRIAD, SUM, TRIAD, UPDATE},
    legend style={
        at={(current bounding box.south-|current axis.south)},
        anchor=north,
        legend columns=-1,
        draw=none,
        /tikz/every even column/.append style={column sep=0.5cm}
    },
    x tick label style={/pgf/number format/1000 sep=},
    x tick label style={rotate=90,anchor=east},
    enlargelimits=0.075,
]
    \addplot
        coordinates {(COPY,32.5) (DAXPY,31.4) (INIT,36.0) (SDAXPY,32.7) (STRIAD,34.5) (SUM,24.1) (TRIAD,34.5) (UPDATE,33.0)};
    \addplot
        coordinates {(COPY,29.8) (DAXPY,29.5) (INIT,34.8) (SDAXPY,26.4) (STRIAD,29.1) (SUM,27.0) (TRIAD,26.7) (UPDATE,34.2)};
    \addplot
        coordinates {(COPY,43.4) (DAXPY,38.8) (INIT,39.5) (SDAXPY,40.4) (STRIAD,40.1) (SUM,37.1) (TRIAD,40.4) (UPDATE,40.5)};

    \legend{gem5 FS,gem5 SE,DRAMSys}
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Average response latency with DDR4-2400.}
\label{fig:latency_ddr4}
\end{figure}

\subsection{Comparison to Ramulator}

In order to evaluate the new simulation frontend with a simulator that uses a similar approach, the benchmarks are compared with Ramulator in this section.
This approach is also based on DBI, more specifically Ramulator uses the Intel Pin-Tool to create a memory access trace of a running application.
Here, the cache filtering takes place when the trace is created instead of while the trace is played back by the simulator.
This means that the simulation of the cache cannot take into account the feedback from the DRAM system and therefore the latencies of the cache are neglected.
Ramulator also uses the count of computational instructions to approximate the delay between two memory accesses.
Since Ramulator uses a CPI value of \textit{4} by default, this is also the value that DRAMSys is configured with.

The cache configuration remains the same as in the gem5 simulations, and the simulation is also performed again with a DDR3-1600 and DDR4-2400 configuration.
However, the address mapping has changed, namely to a row-bank-rank-column-channel address mapping with only one rank and channel respectively.
The exact configuration is listed in Section \ref{sec:address_mappings}.

In contrast to the previous simulations, the benchmarks are now single-threaded.

\begin{table}[!ht]
\caption{Results for bandwidth and bytes read/written with DDR3-1600.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Avg. Latency [ns]} \\
 \cline{2-5}
 & Ramulator & DRAMSys & Ramulator & DRAMSys \\
 \hline
 \hline

 COPY & 3.053 & 2.930 & 66.7 & 74.4 \\
 \hline
 DAXPY & 3.049 & 2.940 & 66.7 & 54.9 \\
 \hline
 INIT & 3.063 & 2.760 & 66.6 & 63.8 \\
 \hline
 SDAXPY & 3.047 & 2.840 & 66.3 & 60.9 \\
 \hline
 STRIAD & 3.058 & 3.180 & 66.6 & 63.8 \\
 \hline
 SUM & 3.039 & 2.650 & 66.6 & 56.3 \\
 \hline
 TRIAD & 3.057 & 3.310 & 66.9 & 57.4 \\
 \hline
 UPDATE & 3.064 & 2.480 & 66.5 & 64.1 \\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_ramulator_bandwidth_ddr3}
\end{table}

\begin{table}[!ht]
\caption{Results for bandwidth and bytes read/written with DDR4-2400.}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Avg. Latency [ns]} \\
 \cline{2-5}
 & Ramulator & DRAMSys & Ramulator & DRAMSys \\
 \hline
 \hline

 COPY & 3.462 & 3.740 & 74.9 & 46.7 \\
 \hline
 DAXPY & 3.454 & 3.240 & 74.9 & 43.0 \\
 \hline
 INIT & 3.480 & 3.340 & 74.5 & 48.9 \\
 \hline
 SDAXPY & 3.475 & 3.430 & 74.1 & 39.5 \\
 \hline
 STRIAD & 3.490 & 3.830 & 74.1 & 41.7 \\
 \hline
 SUM & 3.496 & 3.040 & 73.7 & 38.0 \\
 \hline
 TRIAD & 3.468 & 4.210 & 75.1 & 35.8 \\
 \hline
 UPDATE & 3.478 & 3.130 & 74.6 & 43.6 \\
 \hline

\end{tabular}
\end{center}
\label{tab:benchmark_ramulator_bandwidth_ddr4}
\end{table}

In Tables \ref{tab:benchmark_ramulator_bandwidth_ddr3} and \ref{tab:benchmark_ramulator_bandwidth_ddr4}, it can be seen that the average memory bandwith of Ramulator matches well with the results of DRAMSys.
On average, the absolute deviation is about 19.1\% for the DDR4 simulation, whereas it only amounts to about 10.0\% for the DDR3 configuration.
The differences in the average access latency equal to 41.5\% and 3.6\% for the DDR4 and DDR3 simulations, respectively.

One noticeable aspect is that with Ramulator, the latencies are greater with DDR4 than with DDR3.
In the DRAMSys configuration, this is the opposite case.
A possible explanation could be that Ramulator, as already mentioned, cannot take into account the feedback from the memory system during cache filtering and therefore deviations can occur.

\subsection{Simulation Runtime Analysis}

The last topic for comparison is to analyze the speed increase (i.e., the reduction in \textit{wall clock time}) by using the new simulation frontend compared to a detailed processor simulation.

For this DRAMSys is again compared with gem5 SE and FS.
A comparison with Ramulator would not be meaningful, because the cache filtering takes place at different times: while with Ramulator the trace generation takes longer than with DynamoRIO, the simulation itself is faster.
The database recording feature of DRAMSys is also disabled for these measurements, since the additional file system accesses for this functionality severely degrade the simulator's performance.

Figure \ref{fig:runtimes} presents the runtimes of the various benchmarks and simulators.

\begin{figure}[!ht]
\begin{center}
\begin{tikzpicture}
\begin{axis}[
    width=\textwidth-0.5cm,
    ybar=1pt,
    bar width = 8pt,
    ymin=0,
    ymajorgrids,
    yminorgrids,
    ylabel={Runtime [s]},
    symbolic x coords = {COPY, DAXPY, INIT, SDAXPY, STRIAD, SUM, TRIAD, UPDATE},
    legend style={
        at={(current bounding box.south-|current axis.south)},
        anchor=north,
        legend columns=-1,
        draw=none,
        /tikz/every even column/.append style={column sep=0.5cm}
    },
    x tick label style={/pgf/number format/1000 sep=},
    x tick label style={rotate=90,anchor=east},
    enlargelimits=0.075,
]
    \addplot
        coordinates {(COPY,265.07) (DAXPY,301.15) (INIT,216.9) (SDAXPY,338.08) (STRIAD,352.47) (SUM,213.43) (TRIAD,315.63) (UPDATE,262.51)};
    \addplot
        coordinates {(COPY,129.4) (DAXPY,149.87) (INIT,97.77) (SDAXPY,180.52) (STRIAD,195.25) (SUM,88.57) (TRIAD,166.9) (UPDATE,122.3)};
    \addplot
        coordinates {(COPY,73.096731) (DAXPY,80.801838) (INIT,54.796846) (SDAXPY,97.89146) (STRIAD,113.816785) (SUM,37.074149) (TRIAD,92.063386) (UPDATE,58.63603)};

    \legend{gem5 FS,gem5 SE,DRAMSys}
\end{axis}
\end{tikzpicture}
\end{center}
\caption{Runtimes for the utilized benchmarks with DDR4-2400.}
\label{fig:runtimes}
\end{figure}

As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case.
On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}.
While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently.