diff --git a/inc/0.titlepage.tex b/inc/0.titlepage.tex index 99981dc..a534f66 100644 --- a/inc/0.titlepage.tex +++ b/inc/0.titlepage.tex @@ -49,12 +49,14 @@ Microelectronic Systems Design Research Group \\[3ex] \textbf{Abstract} The performance of today's computing systems depends in particular on the memory system utilized. -With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide a high performance. -However, since this is a complex task within the system design due to the an overwhelming number of possible configurations and their advantages and disadvantages, simulations of the system are essential to evaluate whether the components and configuration parameters used are appropriate for the application. +With the increasing usage of DRAMs, also in mobile and embedded systems, it is important to select a memory configuration that fits the application well to provide high performance. +However, this is a complex task within the system design due to the overwhelming number of possible configurations and their advantages and disadvantages. +In particular, bandwidth and latency requirements of the application must be satisfied. +Consequently, to verify these requirements, simulations of the system are essential to evaluate whether the components and configuration parameters used are suitable for the application. Such a simulation can be accomplished with the DRAM simulation environment DRAMSys. A simulation with DRAMSys requires a realistic stimuli for the memory system that matches the application's behavior, which can be created by the time-consuming simulation of the application using processor core models. To overcome this drawback of very long simulation time, a faster method of creating stimuli for DRAMSys is developed in this thesis. -In this method, access patterns are created by analyzing the application's behavior while it is running on real hardware using dynamic binary instrumentation. +In this method, access patterns are created by analyzing the application's behavior using dynamic binary instrumentation while it is running on real hardware. \vspace{1.0cm} @@ -62,7 +64,9 @@ In this method, access patterns are created by analyzing the application's behav Die Leistung heutiger Rechensysteme hängt insbesondere von dem eingesetzen Speichersystem ab. Mit der zunehmenden Verbreitung von DRAMs auch in mobilen und eingebetteten Systemen ist es wichtig, eine Speicherkonfiguration zu wählen, welche gut zur Anwendung passt, um eine hohe Leistungsfähigkeit zu erzielen. -Da dies jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabenstellung innerhalb des Systemdesigns ist, ist eine Simulation des Systems unabdingbar, um zu bewerten, ob die verwendeten Komponenten und Konfigurationsparameter für die Anwendung geeignet sind. +Dies ist jedoch aufgrund der überwältigenden Anzahl möglicher Konfigurationen und ihrer Vor- und Nachteile eine komplexe Aufgabe innerhalb des Systemdesigns. +Insbesondere die Anforderungen der Anwendung an Bandbreite und Latenzen müssen erfüllt werden. +Folglich sind zur Überprüfung dieser Anforderungen Simulationen des Systems unerlässlich, um zu bewerten, ob die verwendeten Komponenten und Konfigurationsparameter für die Anwendung geeignet sind. Solch eine Simulation kann mit der DRAM Simulationsumgebung DRAMSys durchgeführt werden. Eine Simulation mit DRAMSys erfordert realitätsnahe Stimuli für das Speichersystem, das dem Verhalten der Anwendung entspricht, welches mit einer zeitaufwändigen Simulation der Anwendung mit Prozessorkernmodellen erstellt werden kann. Um diesen Nachteil der sehr langen Simulationszeit zu überwinden, wird in dieser Arbeit eine neue Methode zur Erstellung von Stimuli für DRAMSys entwickelt. diff --git a/inc/3.systemc.tex b/inc/3.systemc.tex index 22e1b4c..5f7232d 100644 --- a/inc/3.systemc.tex +++ b/inc/3.systemc.tex @@ -25,7 +25,7 @@ Moreover, there is the event queue type \texttt{sc\_event\_queue}, which makes i The concepts presented are used in Section \ref{sec:implementation}, where the implementation of various SystemC modules will be discussed. -SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximateley-timed} and \textit{loosley-timed}. +SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximately-timed} and \textit{loosley-timed}. The latter two abstraction levels belog to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}. Another level of abstraction, \textit{untimed}, will not be the subject of this thesis. @@ -52,7 +52,7 @@ GPs are passed as references, so they do not need to be copied between modules. \end{center} \end{figure} -SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximateley-timed}{AT}. +SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximately-timed}{AT}. In the LT coding style, a transaction is blocking, meaning that the transaction will be modeled by only one function call. This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is complete before making the next request. However, the fast simulation time, especially when the so-called concept of \textit{temporal decoupling} with \textit{timing quantums} is used, makes it possible to use this coding style for rapid software development; LT is suitable for developing drivers for a simulated hardware component. diff --git a/inc/4.caches.tex b/inc/4.caches.tex index 9f89bb3..1109ab8 100644 --- a/inc/4.caches.tex +++ b/inc/4.caches.tex @@ -35,7 +35,7 @@ The processor can now perform operations on this data and use its end result wit Programs have a tendency to reference data that is nearby in the memory space of already referenced data. This tendency, spatial locality, arises because related data is often clustered together, for example in arrays or structures. When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other. -Spatial locality can be exploited by organizing blocks of data in so called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word. +Spatial locality can be exploited by organizing blocks of data in so-called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word. This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses. An active form of exploiting spatial locality is the use of \textit{prefetching}. @@ -125,7 +125,7 @@ In case of a \textit{write-through} cache, the underlying memory is updated imme Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty. To mitigate the problem, a write buffer can be used, which allows the processor to make further progress while the data is written. -An alternative is a so called \textit{write-back} cache. +An alternative is a so-called \textit{write-back} cache. Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted. To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used; it denotes if the value has been updated while it has been in the cache. If this is the case, it must be written back to ensure consistency, otherwise it is not necessary. @@ -149,7 +149,7 @@ Figure \ref{fig:virtual_address} shows an exemplary division of a virtual addres \end{figure} Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number. -For conversions, so called \textit{page tables} are used to look up the physical page number. +For conversions, so-called \textit{page tables} are used to look up the physical page number. Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive. To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers. @@ -203,7 +203,7 @@ An architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}. \begin{figure} \begin{center} \tikzfig{img/mshr_file} -\caption[Miss Holding Status Register File \cite{Jahre2007}.]{Miss Holding Status Register File \cite{Jahre2007}. V refers to a valid bit.} +\caption[Miss Status Holding Register File \cite{Jahre2007}.]{Miss Status Holding Register File \cite{Jahre2007}. \textit{V} refers to a valid bit.} \label{fig:mshr_file} \end{center} \end{figure} diff --git a/inc/7.simulation_results.tex b/inc/7.simulation_results.tex index d98a4f2..efca169 100644 --- a/inc/7.simulation_results.tex +++ b/inc/7.simulation_results.tex @@ -93,7 +93,7 @@ Their access patterns are as followed: In the following, the simulation results of the new simulation frontend, the gem5 full-system emulation and the gem5 syscall-emulation will now be presented. \begin{table}[!ht] -\caption{Results for bandwidth and bytes read/written with DDR4-2400.} +\caption{Results for bandwidth and bytes read/written with DDR4-2400. \textit{FS} denotes gem5 full-system, \textit{SE} denotes gem5 syscall-emulation, \textit{DS} denotes DRAMSys.} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|} \hline @@ -174,7 +174,7 @@ Those numbers are also illustrated in Figure \ref{fig:benchmark_gem5_bandwidth_d \end{figure} \begin{table}[!ht] -\caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600. FS denotes gem5 full-system, SE denotes gem5 syscall-emulation, DS denotes DRAMSys.} +\caption[Results for bandwidth and bytes read/written with DDR3-1600.]{Results for bandwidth and bytes read/written with DDR3-1600.} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|} \hline @@ -440,3 +440,4 @@ Figure \ref{fig:runtimes} presents the runtimes of the various benchmarks and si As expected, DRAMSys outperforms the gem5 full-system and syscall-emulation simulators in every case. On average, DRAMSys is 47.0\% faster than gem5 SE and 73.7\% faster than gem5 FS, with a maximum speedup of 82.6\% for the benchmark \texttt{SUM}. While gem5 SE only simulates the target application using the detailed processor model, gem5 FS has to simulate the complete operating system kernel and applications, that run in the background concurrently. +This explains the large runtime differences between these two simulation modes.