Lukas' second improvements
This commit is contained in:
BIN
img/traceanalyzer.png
Normal file
BIN
img/traceanalyzer.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 197 KiB |
@@ -14,7 +14,7 @@
|
|||||||
\node [style=cache 2] (11) at (14, -6.25) {Cache};
|
\node [style=cache 2] (11) at (14, -6.25) {Cache};
|
||||||
\node [style=page frame] (12) at (3, -11) {Page Frame Number};
|
\node [style=page frame] (12) at (3, -11) {Page Frame Number};
|
||||||
\node [style=page frame] (13) at (3, -13) {Tag: Page Frame Number};
|
\node [style=page frame] (13) at (3, -13) {Tag: Page Frame Number};
|
||||||
\node [style=cache data] (14) at (10, -16.5) {Cache Data};
|
\node [style=cache data] (14) at (10, -15.225) {Cache Data};
|
||||||
\node [style=none] (15) at (1.5, -5) {};
|
\node [style=none] (15) at (1.5, -5) {};
|
||||||
\node [style=none] (16) at (6, -1) {};
|
\node [style=none] (16) at (6, -1) {};
|
||||||
\node [style=none] (17) at (6, -3) {};
|
\node [style=none] (17) at (6, -3) {};
|
||||||
@@ -31,7 +31,7 @@
|
|||||||
\node [style=none] (28) at (13.5, -3) {};
|
\node [style=none] (28) at (13.5, -3) {};
|
||||||
\node [style=none] (29) at (10.5, -4.5) {};
|
\node [style=none] (29) at (10.5, -4.5) {};
|
||||||
\node [style=none] (30) at (11.25, -2.5) {Cache Index};
|
\node [style=none] (30) at (11.25, -2.5) {Cache Index};
|
||||||
\node [style=none] (31) at (14, -16) {};
|
\node [style=none] (31) at (14, -14.725) {};
|
||||||
\node [style=none] (32) at (13, -9) {};
|
\node [style=none] (32) at (13, -9) {};
|
||||||
\node [style=none] (33) at (13, -13) {};
|
\node [style=none] (33) at (13, -13) {};
|
||||||
\node [style=none] (34) at (12, 0) {};
|
\node [style=none] (34) at (12, 0) {};
|
||||||
|
|||||||
@@ -51,12 +51,13 @@ Microelectronic Systems Design Research Group \\[3ex]
|
|||||||
The performance of today's computing systems depends in particular on the memory system utilized.
|
The performance of today's computing systems depends in particular on the memory system utilized.
|
||||||
With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide high performance.
|
With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide high performance.
|
||||||
However, this is a complex task within the system design due to the overwhelming number of possible configurations and their advantages and disadvantages.
|
However, this is a complex task within the system design due to the overwhelming number of possible configurations and their advantages and disadvantages.
|
||||||
In particular, bandwidth and latency requirements of the application must be satisfied.
|
In particular, bandwidth and latency requirements must be satisfied.
|
||||||
Consequently, to verify these requirements, simulations of the system are essential to evaluate whether the components and configuration parameters used are suitable for the application.
|
Consequently, to verify these requirements, simulations of the system are essential to evaluate whether the configuration parameters used are suitable for the application.
|
||||||
Such a simulation can be accomplished with the DRAM simulation environment DRAMSys.
|
Such a simulation can be accomplished with the DRAM simulation environment DRAMSys.
|
||||||
A simulation with DRAMSys requires a realistic stimuli for the memory system that matches the application's behavior, which can be created by the time-consuming simulation of the application using processor core models.
|
A simulation requires a realistic stimuli for the memory system that matches the application's behavior, which can be created by the time-consuming simulation of the application using processor core models.
|
||||||
To overcome this drawback of very long simulation time, a faster method of creating stimuli for DRAMSys is developed in this thesis.
|
To overcome this drawback of very long simulation time, a faster method of creating stimuli for DRAMSys is developed in this thesis.
|
||||||
In this method, access patterns are created by analyzing the application's behavior using dynamic binary instrumentation while it is running on real hardware.
|
In this method, access patterns are created by analyzing the application's behavior using dynamic binary instrumentation while it is running on real hardware.
|
||||||
|
With our approach, we are able to simulate 73\% faster compared to gem5 FS while only losing 7\% in accuracy in respect of the bandwidth.
|
||||||
|
|
||||||
\vspace{1.0cm}
|
\vspace{1.0cm}
|
||||||
|
|
||||||
@@ -65,11 +66,12 @@ In this method, access patterns are created by analyzing the application's behav
|
|||||||
Die Leistung heutiger Rechensysteme hängt insbesondere von dem eingesetzen Speichersystem ab.
|
Die Leistung heutiger Rechensysteme hängt insbesondere von dem eingesetzen Speichersystem ab.
|
||||||
Mit der zunehmenden Verbreitung von DRAMs auch in mobilen und eingebetteten Systemen ist es wichtig, eine Speicherkonfiguration zu wählen, welche gut zur Anwendung passt, um eine hohe Leistungsfähigkeit zu erzielen.
|
Mit der zunehmenden Verbreitung von DRAMs auch in mobilen und eingebetteten Systemen ist es wichtig, eine Speicherkonfiguration zu wählen, welche gut zur Anwendung passt, um eine hohe Leistungsfähigkeit zu erzielen.
|
||||||
Dies ist jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabe innerhalb des Systemdesigns.
|
Dies ist jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabe innerhalb des Systemdesigns.
|
||||||
Insbesondere die Anforderungen der Anwendung an Bandbreite und Latenzen müssen erfüllt werden.
|
Insbesondere Anforderungen an Bandbreite und Latenzen müssen erfüllt werden.
|
||||||
Folglich sind zur Überprüfung dieser Anforderungen Simulationen des Systems unerlässlich, um zu bewerten, ob die verwendeten Komponenten und Konfigurationsparameter für die Anwendung geeignet sind.
|
Folglich sind zur Überprüfung dieser Anforderungen Simulationen des Systems unerlässlich, um zu bewerten, ob die verwendeten Konfigurationsparameter für die Anwendung geeignet sind.
|
||||||
Solch eine Simulation kann mit der DRAM Simulationsumgebung DRAMSys durchgeführt werden.
|
Solch eine Simulation kann mit der DRAM Simulationsumgebung DRAMSys durchgeführt werden.
|
||||||
Eine Simulation mit DRAMSys erfordert realitätsnahe Stimuli für das Speichersystem, das dem Verhalten der Anwendung entspricht, welches mit einer zeitaufwändigen Simulation der Anwendung mit Prozessorkernmodellen erstellt werden kann.
|
Eine Simulation erfordert realitätsnahe Stimuli für das Speichersystem, das dem Verhalten der Anwendung entspricht, welches mit einer zeitaufwändigen Simulation der Anwendung mit Prozessorkernmodellen erstellt werden kann.
|
||||||
Um diesen Nachteil der sehr langen Simulationszeit zu überwinden, wird in dieser Arbeit eine neue Methode zur Erstellung von Stimuli für DRAMSys entwickelt.
|
Um diesen Nachteil der sehr langen Simulationszeit zu überwinden, wird in dieser Arbeit eine neue Methode zur Erstellung von Stimuli für DRAMSys entwickelt.
|
||||||
Bei dieser Methode werden Zugriffsmuster durch die Analyse des Verhaltens der Anwendung mittels Instrumentierung erstellt, während sie auf echter Hardware ausgeführt wird.
|
Bei dieser Methode werden Zugriffsmuster durch die Analyse des Verhaltens der Anwendung mittels Instrumentierung erstellt, während sie auf echter Hardware ausgeführt wird.
|
||||||
|
Mit unserem Ansatz sind wir in der Lage, die Simulationen im Vergleich zu gem5 FS um 73\% zu beschleunigen, während wir in Bezug auf die Bandbreite nur 7\% an Genauigkeit verlieren.
|
||||||
|
|
||||||
\end{abstract}
|
\end{abstract}
|
||||||
|
|||||||
@@ -71,10 +71,14 @@ There are three main policies:
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Figure \ref{fig:associativity} illustrates four different organizations for a cache of eight cache lines.
|
Figure \ref{fig:associativity} illustrates four different organizations for a cache of eight cache lines.
|
||||||
In all three cases, the least significant portion of the physical address of the referenced data, the \textit{index}, determines the set in which the data is to store.
|
As an example, a data block with the address \texttt{0x40} may be placed in the second set for the direct-mapped, two-way associative and four-way associative cache configurations.
|
||||||
|
However, in the latter two configurations, the cache can choose the horizontal placement of the block within the set.
|
||||||
|
For the fully associative cache, every cache line is a valid placement as it consists of only one set.
|
||||||
|
|
||||||
|
In each cache configuration, the least significant portion of the physical address of the referenced data, the \textit{index}, determines the set in which the data is to store.
|
||||||
However, several entries in the DRAM map to the same set, so the remaining most significant portion of the address is used as a \textit{tag} and is stored next to the actual data in the cache line.
|
However, several entries in the DRAM map to the same set, so the remaining most significant portion of the address is used as a \textit{tag} and is stored next to the actual data in the cache line.
|
||||||
After an entry is fetched from the cache, the tag is used to determine if the entry actually corresponds to the referenced data.
|
After an entry is fetched from the cache, the tag is used to determine if the entry actually corresponds to the referenced data.
|
||||||
An example subdivision of the address in the index, tag and byte offset is shown in Figure \ref{fig:address_mapping}.
|
An exemplary subdivision of the address in the index, tag and byte offset is shown in Figure \ref{fig:address_mapping}.
|
||||||
|
|
||||||
\begin{figure}[!ht]
|
\begin{figure}[!ht]
|
||||||
\begin{center}
|
\begin{center}
|
||||||
|
|||||||
@@ -60,12 +60,14 @@ DRAMSys also provides the so-called \textit{Trace Analyzer}, a graphical tool th
|
|||||||
An exemplary trace database, visualized in the Trace Analyzer, is shown in Figure \ref{fig:traceanalyzer}.
|
An exemplary trace database, visualized in the Trace Analyzer, is shown in Figure \ref{fig:traceanalyzer}.
|
||||||
Furthermore, the Trace Analyzer is capable of calculating numerous metrics and creating plots of interesting characteristics.
|
Furthermore, the Trace Analyzer is capable of calculating numerous metrics and creating plots of interesting characteristics.
|
||||||
|
|
||||||
|
\begin{landscape}
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\includegraphics[width=\linewidth]{img/traceanalyzer.pdf}
|
\includegraphics[width=\linewidth]{img/traceanalyzer.png}
|
||||||
\caption[Exemplary visualization of a trace database in the Trace Analyzer.]{Exemplary visualization of a trace database in the Trace Analyzer. The used DRAM consists of one rank and eight bank groups with two banks each.}
|
\caption[Exemplary visualization of a trace database in the Trace Analyzer.]{Exemplary visualization of a trace database in the Trace Analyzer. The used DRAM consists of one rank and eight bank groups with two banks each.}
|
||||||
\label{fig:traceanalyzer}
|
\label{fig:traceanalyzer}
|
||||||
\end{center}
|
\end{center}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
\end{landscape}
|
||||||
|
|
||||||
In Section \ref{sec:implementation} of this thesis, a new simulation frontend for DRAMSys will be developed.
|
In Section \ref{sec:implementation} of this thesis, a new simulation frontend for DRAMSys will be developed.
|
||||||
|
|||||||
@@ -25,7 +25,7 @@ The gem5 syscall-emulation does not simulate a whole operating system, rather it
|
|||||||
In contrast, the gem5 full-system simulation boots into a complete Linux system including all processes, that may run in the background.
|
In contrast, the gem5 full-system simulation boots into a complete Linux system including all processes, that may run in the background.
|
||||||
Therefore, syscall-emulation is conceptually closer to the DynamoRIO approach than full-system simulation.
|
Therefore, syscall-emulation is conceptually closer to the DynamoRIO approach than full-system simulation.
|
||||||
|
|
||||||
The simulation setup consists in both cases of a two-level cache hierarchy with the following parameters:
|
In both cases, the simulation setup consists of a two-level cache hierarchy with the following parameters:
|
||||||
|
|
||||||
\begin{table}[!ht]
|
\begin{table}[!ht]
|
||||||
\caption{Cache parameters used in simulations.}
|
\caption{Cache parameters used in simulations.}
|
||||||
@@ -90,7 +90,7 @@ Their access patterns are as followed:
|
|||||||
\label{tab:benchmark_description}
|
\label{tab:benchmark_description}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented.
|
In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation are now presented.
|
||||||
|
|
||||||
\begin{table}[!ht]
|
\begin{table}[!ht]
|
||||||
\caption{Results for bandwidth and bytes read/written with DDR4-2400. \textit{FS} denotes gem5 full-system, \textit{SE} denotes gem5 syscall-emulation, \textit{DS} denotes DRAMSys.}
|
\caption{Results for bandwidth and bytes read/written with DDR4-2400. \textit{FS} denotes gem5 full-system, \textit{SE} denotes gem5 syscall-emulation, \textit{DS} denotes DRAMSys.}
|
||||||
@@ -248,8 +248,8 @@ Here, the absolute deviations in the average memory bandwidth amount to 27.5\% a
|
|||||||
The differences for the amount of bytes read result to 31.6\% for gem5 FS and to 14.7\% to gem5 SE.
|
The differences for the amount of bytes read result to 31.6\% for gem5 FS and to 14.7\% to gem5 SE.
|
||||||
Also here, the bytes written only show small deviations of 5.2\% for gem5 FS and 0.02\% for gem5 SE.
|
Also here, the bytes written only show small deviations of 5.2\% for gem5 FS and 0.02\% for gem5 SE.
|
||||||
|
|
||||||
It has to be noted that the average memory bandwidth for the new trace player is highly influenced by the configured CPI value.
|
% It has to be noted that the average memory bandwidth for the new trace player is highly influenced by the configured CPI value.
|
||||||
So to match a real system, this value has to be chosen wisely to achieve good simulation results for the memory bandwidth.
|
% So to match a real system, this value has to be chosen wisely to achieve good simulation results for the memory bandwidth.
|
||||||
|
|
||||||
|
|
||||||
% Latency und simulation time
|
% Latency und simulation time
|
||||||
@@ -440,4 +440,5 @@ Figure \ref{fig:runtimes} presents the runtimes of the various benchmarks and si
|
|||||||
As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case.
|
As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case.
|
||||||
On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}.
|
On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}.
|
||||||
While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently.
|
While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently.
|
||||||
This explains the large runtime differences between these two simulation modes.
|
However, the bootup process of the operating system was not included in the simulations.
|
||||||
|
These conceptual differences explains the large runtime deviations between the two simulation modes.
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
\label{sec:future_work}
|
\label{sec:future_work}
|
||||||
|
|
||||||
Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
|
Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
|
||||||
It not only has an high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
|
It not only has a high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
|
||||||
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
|
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
|
||||||
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
|
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
|
||||||
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, which fastens the process while only making few compromises on accuracy.
|
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, which fastens the process while only making few compromises on accuracy.
|
||||||
@@ -29,7 +29,7 @@ This deviation could be prevented by recording used processor cores on the initi
|
|||||||
|
|
||||||
Another inaccuracy can be caused by the hyperthreading of some of today's processors:
|
Another inaccuracy can be caused by the hyperthreading of some of today's processors:
|
||||||
While hyperthreading enables the parallel processing of two pipelines in a processor core, those threads do share the same first level cache.
|
While hyperthreading enables the parallel processing of two pipelines in a processor core, those threads do share the same first level cache.
|
||||||
Currently, this is not taken into account and every application thread gets its own first level cache assigned.
|
Currently, this is not taken into account, and each application thread is assigned its own first level cache.
|
||||||
|
|
||||||
Further room for improvement offers the consideration of the special prefetch and instructions the architectures provide.
|
Further room for improvement offers the consideration of the special prefetch and instructions the architectures provide.
|
||||||
DynamoRIO already offers an interface to catch those instructions without much effort.
|
DynamoRIO already offers an interface to catch those instructions without much effort.
|
||||||
|
|||||||
Reference in New Issue
Block a user