Simulations Kapitel

2022-07-16 18:05:29 +02:00
parent f542b2c034
commit 6324ae1d3d
4 changed files with 82 additions and 40 deletions
--- a/inc/1.introduction.tex
+++ b/inc/1.introduction.tex
@@ -2,7 +2,7 @@
 \label{sec:introduction}
 %vlt noch warum DRAMs immer mehr eingesetzt werden
 Today's computing systems accompany us in almost all areas of life in the form of smart devices, computers, or game consoles.
-With the increasing performance requirements on these devices, not only faster processors are needed, but also high-performance memory systems, namely dynamic random access memories, which are supposed to deliver a lot of bandwidth at low latency.
+With the increasing performance requirements on these devices, not only faster processors are needed, but also high-performance memory systems, namely dynamic random access memories, which are supposed to deliver a lot of bandwidth at a low latency.
 While these storage systems are very complex and offer a lot of room for configuration, e.g., the \revabbr{dynamic random-access memory}{DRAM} standard, the memory controller configuration or the address mapping, there are different requirements for the very different applications\cite{Gomony2012}.
 Consequently, system designers are entrusted with the complex task of finding the most effective configurations that match the performance and power contraints with good optimizations applied for the specific use case.

--- a/inc/7.simulation_results.tex
+++ b/inc/7.simulation_results.tex
@@ -128,7 +128,7 @@ In the following, the simulation results of the new simulation frontend, the gem
 Listed in Table \ref{tab:benchmark_gem5_bandwidth_ddr4} are three key parameters, specifically the average memory bandwidth and the number of bytes that has been read or written for the DDR4-2400 configuration.
 The results show that all parameters of DRAMSys correlate well with the gem5 statistics.
 While for the average bandwidth the DynamoRIO results are on average 31.0\% slower compared to gem5 SE, this deviation is only 11.1\% for gem5 FS.
-The numbers for the total amount of bytes read result in a deviation of 35.5\% in comparision to gem5 FS and only to 14.6\% to gem5 SE.
+The numbers for the total amount of bytes read result in a deviation of 35.5\% in comparison to gem5 FS and only to 14.6\% to gem5 SE.
 The amount of bytes written, on the other hand, shows a very small deviation of 5.2\% for gem5 FS and only 0.07\% for gem5 SE.
 Therefore, it can be stated that almost the same number of bytes were written back to DRAM due to cache write-backs.

@@ -254,6 +254,14 @@ So to match a real system, this value has to be chosen wisely to achieve good si

 % Latency und simulation time

+Another important metric in the evaluation of a memory subsystem is the average response latency of a memory access.
+In Figure \ref{fig:latency_ddr4}, the average latencies of the DRAM are illustrated for the DDR4-2400 configuration.
+
+While the latencies reported by DRAMSys are always higher for the respective benchmark, it averages to a deviation of 36.0\% in comparison to gem5 SE and to 24.9\% to gem5 FS.
+
+Those numbers can be looked up in greater detail in Table \ref{tab:benchmark_gem5_access_ddr3} for the DDR3-1600 and in Table \ref{tab:benchmark_gem5_access_ddr4} for the DDR4-2400 configuration.
+These tables also provide information about the simulation time of the different benchmarks.
+
 \begin{figure}[!ht]
 \begin{center}
 \begin{tikzpicture}
@@ -288,40 +296,51 @@ So to match a real system, this value has to be chosen wisely to achieve good si
 \end{axis}
 \end{tikzpicture}
 \end{center}
-\caption{Average latency with DDR4-2400.}
+\caption{Average response latency with DDR4-2400.}
+\label{fig:latency_ddr4}
 \end{figure}

-
 \subsection{Comparison to Ramulator}

-Noch die Konfiguration mit MHz und so neu erzählen.
+In order to evaluate the new simulation frontend with a simulator that uses a similar approach, the benchmarks are now compared with ramulator.
+This approach is also based on DBI, more specifically Ramulator uses the Intel Pin-Tool to create a memory access trace of any application.
+Cache filtering takes place when the trace is created instead of while the trace is simulated by Ramulator.
+This means that the simulation of the cache cannot take into account the feedback from the DRAM system and the latencies of the cache are neglected.
+Ramulator also uses the number of computational instructions to approximate the delay between two memory accesses.
+Since Ramulator uses a CPI value of \textit{4} by default, this is also the value that DRAMSys is configured with.
+
+The cache configuration remains the same as in the gem5 simulations, and the simulation is also performed again with a DDR3-1600 and DDR4-2400 configuration.
+However, address mapping has changed, namely to a row-bank-rank-column-channel address mapping with only one rank and channel respectively.
+The exact configuration is listed in Section \ref{sec:address_mappings}.
+
+In contrast to the previous simulations, the benchmarks are now single-threaded.

 \begin{table}[!ht]
 \caption{Results for bandwidth and bytes read/written with DDR3-1600.}
 \begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
+\begin{tabular}{|c|c|c|c|c|}
 \hline
- \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Bytes Read [MB]} & \multicolumn{2}{|c|}{Bytes Written [MB]} \\
- \cline{2-7}
- & Ramulator & DRAMSys & Ramulator & DRAMSys & Ramulator & DRAMSys\\
+ \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Avg. Latency [ns]} \\
+ \cline{2-5}
+ & Ramulator & DRAMSys & Ramulator & DRAMSys \\
 \hline
 \hline

- COPY & 3.053 & 2.93 & 60.2 & 420.3 & 26.4 & 210.1 \\
+ COPY & 3.053 & 2.930 & 66.7 & 74.4 \\
 \hline
- DAXPY & 3.049 & 2.94 & 60.3 & 420.2 & 26.5 & 210.1 \\
+ DAXPY & 3.049 & 2.940 & 66.7 & 54.9 \\
 \hline
- INIT & 3.063 & 2.76 & 60.9 & 271.6 & 26.8 & 210.1 \\
+ INIT & 3.063 & 2.760 & 66.6 & 63.8 \\
 \hline
- SDAXPY & 3.047 & 2.84 & 60.6 & 570.1 & 26.9 & 210.1 \\
+ SDAXPY & 3.047 & 2.840 & 66.3 & 60.9 \\
 \hline
- STRIAD & 3.058 & 3.18 & 60.7 & 720.4 & 26.7 & 210.1 \\
+ STRIAD & 3.058 & 3.180 & 66.6 & 63.8 \\
 \hline
- SUM & 3.039 & 2.65 & 61.4 & 270.1 & 27.2 & 60.1 \\
+ SUM & 3.039 & 2.650 & 66.6 & 56.3 \\
 \hline
- TRIAD & 3.057 & 3.31 & 60.6 & 570.1 & 26.7 & 210.1 \\
+ TRIAD & 3.057 & 3.310 & 66.9 & 57.4 \\
 \hline
- UPDATE & 3.064 & 2.48 & 61.0 & 271.6 & 26.7 & 210.1 \\
+ UPDATE & 3.064 & 2.480 & 66.5 & 64.1 \\
 \hline

 \end{tabular}
@@ -332,38 +351,53 @@ Noch die Konfiguration mit MHz und so neu erzählen.
 \begin{table}[!ht]
 \caption{Results for bandwidth and bytes read/written with DDR4-2400.}
 \begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
+\begin{tabular}{|c|c|c|c|c|}
 \hline
- \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Bytes Read [MB]} & \multicolumn{2}{|c|}{Bytes Written [MB]} \\
- \cline{2-7}
- & Ramulator & DRAMSys & Ramulator & DRAMSys & Ramulator & DRAMSys\\
+ \multirow{2}*{Benchmark} & \multicolumn{2}{|c|}{Avg. Bandwidth [GB/s]} & \multicolumn{2}{|c|}{Avg. Latency [ns]} \\
+ \cline{2-5}
+ & Ramulator & DRAMSys & Ramulator & DRAMSys \\
 \hline
 \hline

- COPY & 3.462 & 3.740 & 60.2 & 269.0 & 26.4 & 134.4 \\
+ COPY & 3.462 & 3.740 & 74.9 & 46.7 \\
 \hline
- DAXPY & 3.454 & 3.240 & 60.3 & 268.9 & 26.5 & 134.4 \\
+ DAXPY & 3.454 & 3.240 & 74.9 & 43.0 \\
 \hline
- INIT & 3.480 & 3.340 & 60.9 & 173.8 & 26.8 & 134.4 \\
+ INIT & 3.480 & 3.340 & 74.5 & 48.9 \\
 \hline
- SDAXPY & 3.475 & 3.430 & 60.6 & 364.9 & 26.9 & 134.4 \\
+ SDAXPY & 3.475 & 3.430 & 74.1 & 39.5 \\
 \hline
- STRIAD & 3.490 & 3.830 & 60.7 & 461.0 & 26.7 & 134.4 \\
+ STRIAD & 3.490 & 3.830 & 74.1 & 41.7 \\
 \hline
- SUM & 3.496 & 3.040 & 61.4 & 172.9 & 27.2 & 38.4 \\
+ SUM & 3.496 & 3.040 & 73.7 & 38.0 \\
 \hline
- TRIAD & 3.468 & 4.210 & 60.6 & 364.9 & 26.7 & 134.4 \\
+ TRIAD & 3.468 & 4.210 & 75.1 & 35.8 \\
 \hline
- UPDATE & 3.478 & 3.130 & 61.0 & 173.9 & 26.7 & 134.4 \\
+ UPDATE & 3.478 & 3.130 & 74.6 & 43.6 \\
 \hline

 \end{tabular}
 \end{center}
-\label{tab:benchmark_ramulator_bandwidth_ddr3}
+\label{tab:benchmark_ramulator_bandwidth_ddr4}
 \end{table}

+In Tables \ref{tab:benchmark_ramulator_bandwidth_ddr3} and \ref{tab:benchmark_ramulator_bandwidth_ddr4}, it can be seen that the average memory bandwith of Ramulator matches well with the results of DRAMSys.
+On average, the absolute deviation is about 19.1\% for the DDR4 simulation, whereas it only amounts to about 10.0\% for the DDR3 configuration.
+The differences in the average access latency equal to 41.5\% and 3.6\% for the DDR4 and DDR3 simulations, respectively.
+
+One noticeable aspect is that with Ramulator, the latencies are greater with DDR4 than with DDR3. in the DRAMSys configuration, this is the opposite case.
+A possible explanation could be that, as mentioned before, ramulator cannot take the feedback from the memory system into account in the cache filtering and therefore deviations may occur.
+
 \subsection{Simulation Runtime Analysis}

+The last topic for comparison is to analyze the speed increase (i.e., the reduction in \textit{wall clock time}) by using the new simulation frontend compared to a detailed processor simulation.
+
+For this DRAMSys is again compared with gem5 SE and FS.
+A comparison with Ramulator would not be meaningful, because the cache filtering takes place at different times: while with Ramulator the trace generation takes longer than with DynamoRIO, the simulation is faster.
+The database recording feature of DRAMSys is also disabled for these measurements, since the additional file system accesses for this functionality severely degrade the simulator's performance.
+
+Figure \ref{fig:runtimes} presents the runtimes of the various benchmarks and simulators.
+
 \begin{figure}[!ht]
 \begin{center}
 \begin{tikzpicture}
@@ -398,5 +432,11 @@ Noch die Konfiguration mit MHz und so neu erzählen.
 \end{axis}
 \end{tikzpicture}
 \end{center}
-\caption{ddr4.}
+\caption{Runtimes for the utilized benchmarks with DDR4-2400.}
+\label{fig:runtimes}
 \end{figure}
+
+As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case.
+On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}.
+While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently.
+
--- a/inc/8.future_work.tex
+++ b/inc/8.future_work.tex
@@ -1,17 +1,18 @@
 \section{Future Work}
 \label{sec:future_work}

-Due to the complexity of possible memory sub-system configurations, simulation is an indispensable part of the development process of today's systems.
+Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
 It not only has an high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
 However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
-On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes it often unsuitable.
+On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
 To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, that is fast and makes only few compromises on accuracy.

 In conclusion, the newly developed instrumentation tool provides an flexible way of generating traces for arbitrary multi-threaded applications.
 The mature DRAMSys simulator framework then can be used to explore the design space and vary numerous configuration parameters of the DRAM subsystem to find a well-suited set of options.

-It was shown that in comparison to the well-established full-system simulation framework gem5, only small deviations have to be accepted.
+It was shown that in comparison to the well-established full-system simulation framework gem5, only some deviations have to be accepted.
 Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new fronted. %(ergenisse kurz hier zusammenfassen)
+Although Ramulator takes a slightly different approach to trace generation than this thesis, a very good correlation in the results could be demonstrated.
 A noteworthy advantage of the newly developed tool is its support for all hardware architectures that DynamoRIO provides (currently IA-32, x86-64, ARM, and AArch64) in contrast to the supported architectures of Pin (IA-32 and x86-64).

 Still, there is room for improvement.
@@ -36,10 +37,10 @@ Support for this would have to be added to the core and cache models as well as

 The recorded number of computational instructions between each memory access, which are used to esimate the time between those accesses, is multiplied with the clock period of the trace player.
 However, this is a vast simplification of the real timing behavior of a processor.
-In the future, the DynamoRIO tool could decode those computational instructions and create a better estimate of the execution time of those instructions, based on statistical estimates that have been published before\cite{Abel19a}\cite{Fog2022}.
+In the future, the DynamoRIO tool could decode those computational instructions and create a better estimate of the execution time of those instructions, based on statistical estimates that have been published before \cite{Abel19a, Fog2022}.

 One significant improvement that still could be applied is the consideration of dependencies between the memory accesses.
-Similarily to the elastic trace player of gem5\cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
+Similarily to the elastic trace player of gem5 \cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
 By using this technique, it is possible to also model out-of-order behavior of modern processors and make the simulation more accurate, whereas the current implementation is entirely in-order.

-These mentioned potential improvements could make the new simulation frontend for dramsys even more accurate.
+These mentioned potential improvements could make the new simulation frontend for DRAMSys even more accurate.
--- a/inc/appendix.tex
+++ b/inc/appendix.tex
@@ -72,6 +72,7 @@
 \end{center}
 \end{table}

+\newpage
 \subsection{Simulation Results}
 \label{sec:appendix_sim_results}

@@ -109,7 +110,7 @@
 \end{table}

 \begin{table}[!ht]
-\caption{Results for memory access latency and data bus utilization with DDR3-1600.}
+\caption{Results for the total simulation time and the average response latency with DDR3-1600.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline
@@ -138,11 +139,11 @@

 \end{tabular}
 \end{center}
-\label{tab:benchmark_access_ddr3}
+\label{tab:benchmark_gem5_access_ddr3}
 \end{table}

 \begin{table}[!ht]
-\caption{Results for memory access latency and data bus utilization with DDR4-2400.}
+\caption{Results for the total simulation time and the average response latency with DDR4-2400.}
 \begin{center}
 \begin{tabular}{|c|c|c|c|c|c|c|}
 \hline