Start of simulation chapter

2022-06-28 15:54:15 +02:00
parent 681cda3d8d
commit 6d781f5cd8
3 changed files with 282 additions and 4 deletions
--- a/Bachelorarbeit.kilepr
+++ b/Bachelorarbeit.kilepr
@@ -92,9 +92,9 @@ mode=LaTeX
 [item:inc/8.future_work.tex]
 archive=true
-encoding=
+encoding=UTF-8
-highlight=
+highlight=LaTeX
-mode=
+mode=LaTeX
 [item:inc/appendix.tex]
 archive=true
--- a/doc.bib
+++ b/doc.bib
@@ -188,4 +188,16 @@
  doi       = {10.1109/SAMOS.2016.7818336},
 }
@Article{Qemu,
  journal = {A generic and open source machine emulator and virtualizer},
  title   = {Q{E}{M}{U}},
  note    = {https://www.qemu.org/. Accessed: 2022-06-28},
 }
@Article{TheBandwidthBenchmark,
  author = {Erlangen National High Performance Computing Center},
  title  = {The {B}andwidth {B}enchmark},
  note   = {https://github.com/RRZE-HPC/TheBandwidthBenchmark. Accessed: 2022-06-28},
 }
@Comment{jabref-meta: databaseType:bibtex;}
--- a/inc/7.simulation_results.tex
+++ b/inc/7.simulation_results.tex
@@ -19,4 +19,270 @@ Since the DBI cannot observe the fetching of those instructions, the new simulat
 \subsection{Comparison to the gem5 Simulator}
-At first, the micro-benchmark suite TheBandwithBenchmark\cite{} will be used to compare the gem5 full-system simulation as well as the gem5 syscall-emulation simulation modes with the newly developed frontend.
+At first, the micro-benchmark suite TheBandwithBenchmark\cite{TheBandwidthBenchmark}, containing various streaming kernels, will be used to compare the gem5 full-system simulation as well as the gem5 syscall-emulation simulation with the newly developed frontend.
 The simulation setup consists in both cases of a two-level cache hierarchy with the following parameters:
 \begin{table}[!ht]
 \caption{Cache parameters.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 Cache & Size & Associativity & Line size & MSHRs & MSHR targets & WB entries\\
 \hline
 \hline
 L1 & 32 kiB & 8 & 64 & 4 & 20 & 8\\
 \hline
 L2 & 256 kiB & 4 & 64 & 20 & 12 & 8\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:cache_parameters}
 \end{table}
 In this configuration, every processor core has its own L1 data cache (in case of gem5 also a L1 instruction cache) whereas the L2 cache is shared between all cores.
 The gem5 simulator uses four ARM CPU core models (TimingSimpleCPU, an in-order core model) at 1 GHz, whereas the DynamoRIO traces are obtained using a QEMU\cite{Qemu} ARM virtual machine, configured to use four cores as well.
 The DRAM subsystem will be varied between a single-channel DDR3 memory (1600 MT/s) and a single-channel DDR4 memory (2400 MT/s).
 % Hier die DRAMSys Configuration erklären!
 To match the same configuration as in gem5, the memory controller in DRAMSys is set to use a \revabbr{first-ready - first-come, first-served}{FR-FCFS} scheduling policy, a \revabbr{first-in, first-out}{FIFO} response queue policy, and a row-rank-bank-column-channel address mapping.
 The trace player operates at the same clock frequency as the gem5 core models.
 The micro-benchmarks itself are multi-threaded and use all four cores.
 Their access patterns are as followed:
 \begin{table}[!ht]
 \caption{Access patterns of the micro-benchmark kernels\cite{TheBandwidthBenchmark}.}
 \begin{center}
 \begin{tabular}{|c|c|c|}
 \hline
 Benchmark kernel & Description & Access pattern \\
 \hline
 \hline
 INIT & Initialize an array & a = s (store, write allocate) \\
 \hline
 SUM & Vector reduction & s += a (load)\\
 \hline
 COPY & Memory copy & a = b (load, store, write allocate)\\
 \hline
 UPDATE & Update vector & a = a * scalar (load, store)\\
 \hline
 TRIAD & Stream triad & a = b + c * scalar (load, store, write allocate)\\
 \hline
 DAXPY & Daxpy & a = a + b * scalar (load, store)\\
 \hline
 STRIAD & Schönauer triad & a = b + c * d (load, store, write allocate)\\
 \hline
 SDAXPY & Schönauer triad & a = a + b * c (load, store)\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_description}
 \end{table}
 In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented.
 \begin{table}[!ht]
 \caption{Results for bandwidth and bytes read/written with DDR3-1600. FS denotes gem5 full-system, SE denotes gem5 syscall-emulation, DS denotes DRAMSys.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2.031 & 2.698& 4 & 238.3 & 268.8& 7 & 140.3 & 134.3 & 10\\
 \hline
 DAXPY & 2.070 & 2.627& 4 & 238.2 & 268.9 & 7 & 140.2 & 134.4 & 10\\
 \hline
 INIT & 2.028 & 2.629& 4 & 141.9 & 172.9 & 7 & 140.1 & 134.4 & 10\\
 \hline
 SDAXPY & 2.101 & 2.755& 4 & 335.1 & 364.8 & 7 & 140.4 & 134.4 & 10\\
 \hline
 STRIAD & 2.228 & 2.613& 4 & 431.6& 460.9 & 7 & 140.4 & 134.4 & 10\\
 \hline
 SUM & 1.393 & 1.969& 4 & 142.0 & 172.9 & 7 & 44.1 & 38.5 & 10\\
 \hline
 TRIAD & 2.162 & 2.725& 4 & 335.1 & 364.9 & 7 & 140.4 & 134.4 & 10\\
 \hline
 UPDATE & 1.938 & 2.528& 4 & 142.0& 172.8 & 7 & 140.1 & 134.3 & 10\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_bandwidth_ddr3}
 \end{table}
 \begin{table}[!ht]
 \caption{Results for bandwidth and bytes read/written with DDR4-2400.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{3}{|c|}{Bytes Read [MB]} & \multicolumn{3}{|c|}{Bytes Written [MB]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_bandwidth_ddr4}
 \end{table}
 Tables \ref{tab:benchmark_bandwidth_ddr3} and \ref{tab:benchmark_bandwidth_ddr4}
 \begin{table}[!ht]
 \caption{Results for memory access latency and data bus utilization with DDR3-1600.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Access Latency [ns]} & \multicolumn{3}{|c|}{Data Bus Utilization [\%]} \\
 \cline{2-7}
 & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_access_ddr3}
 \end{table}
 \begin{table}[!ht]
 \caption{Results for memory access latency and data bus utilization with DDR4-2400.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Avg. Access Latency [ns]} & \multicolumn{3}{|c|}{Data Bus Utilization [\%]} \\
 \cline{2-7}
 & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_access_ddr4}
 \end{table}
 \begin{table}[!ht]
 \caption{Results last-level cache (L2) statistics with DDR3-1600.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Hits} & \multicolumn{3}{|c|}{Misses} & \multicolumn{3}{|c|}{Miss Rate [\%]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_cache_ddr3}
 \end{table}
 \begin{table}[!ht]
 \caption{Results last-level cache (L2) statistics with DDR4-2400.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
 \hline
 \multirow{2}*{Benchmark} & \multicolumn{3}{|c|}{Hits} & \multicolumn{3}{|c|}{Misses} & \multicolumn{3}{|c|}{Miss Rate [\%]} \\
 \cline{2-10}
 & FS & SE & DS & FS & SE & DS & FS & SE & DS\\
 \hline
 \hline
 COPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 DAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 INIT & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SDAXPY & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 STRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 SUM & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 TRIAD & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 UPDATE & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10\\
 \hline
 \end{tabular}
 \end{center}
 \label{tab:benchmark_cache_ddr4}
 \end{table}
 % \subsubsection{New simulation frontend}
 %
 % \subsubsection{gem5 full-system mode}
 %
 % \subsubsection{gem5 syscall-emulation mode}
 \subsection{Comparison to Ramulator}
 \subsection{Simulation Runtime}