Some fixes

This commit is contained in:
2022-07-21 15:24:26 +02:00
parent 60e8894527
commit 98add62119
4 changed files with 17 additions and 12 deletions

View File

@@ -49,12 +49,14 @@ Microelectronic Systems Design Research Group \\[3ex]
\textbf{Abstract} \textbf{Abstract}
The performance of today's computing systems depends in particular on the memory system utilized. The performance of today's computing systems depends in particular on the memory system utilized.
With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide a high performance. With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide high performance.
However, since this is a complex task within the system design due to the an overwhelming number of possible configurations and their advantages and disadvantages, simulations of the system are essential to evaluate whether the components and configuration parameters used are appropriate for the application. However, this is a complex task within the system design due to the overwhelming number of possible configurations and their advantages and disadvantages.
In particular, bandwidth and latency requirements of the application must be satisfied.
Consequently, to verify these requirements, simulations of the system are essential to evaluate whether the components and configuration parameters used are suitable for the application.
Such a simulation can be accomplished with the DRAM simulation environment DRAMSys. Such a simulation can be accomplished with the DRAM simulation environment DRAMSys.
A simulation with DRAMSys requires a realistic stimuli for the memory system that matches the application's behavior, which can be created by the time-consuming simulation of the application using processor core models. A simulation with DRAMSys requires a realistic stimuli for the memory system that matches the application's behavior, which can be created by the time-consuming simulation of the application using processor core models.
To overcome this drawback of very long simulation time, a faster method of creating stimuli for DRAMSys is developed in this thesis. To overcome this drawback of very long simulation time, a faster method of creating stimuli for DRAMSys is developed in this thesis.
In this method, access patterns are created by analyzing the application's behavior while it is running on real hardware using dynamic binary instrumentation. In this method, access patterns are created by analyzing the application's behavior using dynamic binary instrumentation while it is running on real hardware.
\vspace{1.0cm} \vspace{1.0cm}
@@ -62,7 +64,9 @@ In this method, access patterns are created by analyzing the application's behav
Die Leistung heutiger Rechensysteme hängt insbesondere von dem eingesetzen Speichersystem ab. Die Leistung heutiger Rechensysteme hängt insbesondere von dem eingesetzen Speichersystem ab.
Mit der zunehmenden Verbreitung von DRAMs auch in mobilen und eingebetteten Systemen ist es wichtig, eine Speicherkonfiguration zu wählen, welche gut zur Anwendung passt, um eine hohe Leistungsfähigkeit zu erzielen. Mit der zunehmenden Verbreitung von DRAMs auch in mobilen und eingebetteten Systemen ist es wichtig, eine Speicherkonfiguration zu wählen, welche gut zur Anwendung passt, um eine hohe Leistungsfähigkeit zu erzielen.
Da dies jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabenstellung innerhalb des Systemdesigns ist, ist eine Simulation des Systems unabdingbar, um zu bewerten, ob die verwendeten Komponenten und Konfigurationsparameter für die Anwendung geeignet sind. Dies ist jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabe innerhalb des Systemdesigns.
Insbesondere die Anforderungen der Anwendung an Bandbreite und Latenzen müssen erfüllt werden.
Folglich sind zur Überprüfung dieser Anforderungen Simulationen des Systems unerlässlich, um zu bewerten, ob die verwendeten Komponenten und Konfigurationsparameter für die Anwendung geeignet sind.
Solch eine Simulation kann mit der DRAM Simulationsumgebung DRAMSys durchgeführt werden. Solch eine Simulation kann mit der DRAM Simulationsumgebung DRAMSys durchgeführt werden.
Eine Simulation mit DRAMSys erfordert realitätsnahe Stimuli für das Speichersystem, das dem Verhalten der Anwendung entspricht, welches mit einer zeitaufwändigen Simulation der Anwendung mit Prozessorkernmodellen erstellt werden kann. Eine Simulation mit DRAMSys erfordert realitätsnahe Stimuli für das Speichersystem, das dem Verhalten der Anwendung entspricht, welches mit einer zeitaufwändigen Simulation der Anwendung mit Prozessorkernmodellen erstellt werden kann.
Um diesen Nachteil der sehr langen Simulationszeit zu überwinden, wird in dieser Arbeit eine neue Methode zur Erstellung von Stimuli für DRAMSys entwickelt. Um diesen Nachteil der sehr langen Simulationszeit zu überwinden, wird in dieser Arbeit eine neue Methode zur Erstellung von Stimuli für DRAMSys entwickelt.

View File

@@ -25,7 +25,7 @@ Moreover, there is the event queue type \texttt{sc\_event\_queue}, which makes i
The concepts presented are used in Section \ref{sec:implementation}, where the implementation of various SystemC modules will be discussed. The concepts presented are used in Section \ref{sec:implementation}, where the implementation of various SystemC modules will be discussed.
SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximateley-timed} and \textit{loosley-timed}. SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximately-timed} and \textit{loosley-timed}.
The latter two abstraction levels belog to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}. The latter two abstraction levels belog to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}.
Another level of abstraction, \textit{untimed}, will not be the subject of this thesis. Another level of abstraction, \textit{untimed}, will not be the subject of this thesis.
@@ -52,7 +52,7 @@ GPs are passed as references, so they do not need to be copied between modules.
\end{center} \end{center}
\end{figure} \end{figure}
SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximateley-timed}{AT}. SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximately-timed}{AT}.
In the LT coding style, a transaction is blocking, meaning that the transaction will be modeled by only one function call. In the LT coding style, a transaction is blocking, meaning that the transaction will be modeled by only one function call.
This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is complete before making the next request. This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is complete before making the next request.
However, the fast simulation time, especially when the so-called concept of \textit{temporal decoupling} with \textit{timing quantums} is used, makes it possible to use this coding style for rapid software development; LT is suitable for developing drivers for a simulated hardware component. However, the fast simulation time, especially when the so-called concept of \textit{temporal decoupling} with \textit{timing quantums} is used, makes it possible to use this coding style for rapid software development; LT is suitable for developing drivers for a simulated hardware component.

View File

@@ -35,7 +35,7 @@ The processor can now perform operations on this data and use its end result wit
Programs have a tendency to reference data that is nearby in the memory space of already referenced data. Programs have a tendency to reference data that is nearby in the memory space of already referenced data.
This tendency, spatial locality, arises because related data is often clustered together, for example in arrays or structures. This tendency, spatial locality, arises because related data is often clustered together, for example in arrays or structures.
When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other. When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other.
Spatial locality can be exploited by organizing blocks of data in so called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word. Spatial locality can be exploited by organizing blocks of data in so-called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word.
This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses. This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses.
An active form of exploiting spatial locality is the use of \textit{prefetching}. An active form of exploiting spatial locality is the use of \textit{prefetching}.
@@ -125,7 +125,7 @@ In case of a \textit{write-through} cache, the underlying memory is updated imme
Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty. Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty.
To mitigate the problem, a write buffer can be used, which allows the processor to make further progress while the data is written. To mitigate the problem, a write buffer can be used, which allows the processor to make further progress while the data is written.
An alternative is a so called \textit{write-back} cache. An alternative is a so-called \textit{write-back} cache.
Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted. Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted.
To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used; it denotes if the value has been updated while it has been in the cache. To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used; it denotes if the value has been updated while it has been in the cache.
If this is the case, it must be written back to ensure consistency, otherwise it is not necessary. If this is the case, it must be written back to ensure consistency, otherwise it is not necessary.
@@ -149,7 +149,7 @@ Figure \ref{fig:virtual_address} shows an exemplary division of a virtual addres
\end{figure} \end{figure}
Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number. Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number.
For conversions, so called \textit{page tables} are used to look up the physical page number. For conversions, so-called \textit{page tables} are used to look up the physical page number.
Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive. Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive.
To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers. To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers.
@@ -203,7 +203,7 @@ An architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}.
\begin{figure} \begin{figure}
\begin{center} \begin{center}
\tikzfig{img/mshr_file} \tikzfig{img/mshr_file}
\caption[Miss Holding Status Register File \cite{Jahre2007}.]{Miss Holding Status Register File \cite{Jahre2007}. V refers to a valid bit.} \caption[Miss Status Holding Register File \cite{Jahre2007}.]{Miss Status Holding Register File \cite{Jahre2007}. \textit{V} refers to a valid bit.}
\label{fig:mshr_file} \label{fig:mshr_file}
\end{center} \end{center}
\end{figure} \end{figure}

View File

@@ -93,7 +93,7 @@ Their access patterns are as followed:
In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented. In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented.
\begin{table}[!ht] \begin{table}[!ht]
\caption{Results for bandwidth and bytes read/written with DDR4-2400.} \caption{Results for bandwidth and bytes read/written with DDR4-2400. \textit{FS} denotes gem5 full-system, \textit{SE} denotes gem5 syscall-emulation, \textit{DS} denotes DRAMSys.}
\begin{center} \begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|} \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
\hline \hline
@@ -174,7 +174,7 @@ Those numbers are also illustrated in Figure \ref{fig:benchmark_gem5_bandwidth_d
\end{figure} \end{figure}
\begin{table}[!ht] \begin{table}[!ht]
\caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600. FS denotes gem5 full-system, SE denotes gem5 syscall-emulation, DS denotes DRAMSys.} \caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600.}
\begin{center} \begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|} \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|}
\hline \hline
@@ -440,3 +440,4 @@ Figure \ref{fig:runtimes} presents the runtimes of the various benchmarks and si
As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case. As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case.
On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}. On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}.
While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently. While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently.
This explains the large runtime differences between these two simulation modes.