Fixes from Niklas, Johannes, Hendrik

This commit is contained in:
2022-08-05 10:54:38 +02:00
parent 98add62119
commit 27ec50fab7
8 changed files with 48 additions and 53 deletions

View File

@@ -2,8 +2,8 @@
\label{sec:introduction}
%vlt noch warum DRAMs immer mehr eingesetzt werden
Today's computing systems accompany us in almost all areas of life in the form of smart devices, computers, or game consoles.
With the increasing performance requirements on these devices, not only faster processors are needed, but also high-performance memory systems, namely dynamic random access memories, which are supposed to deliver a lot of bandwidth at a low latency.
While these storage systems are very complex and offer a lot of room for configuration, e.g., the \revabbr{dynamic random-access memory}{DRAM} standard, the memory controller configuration or the address mapping, there are different requirements for the very different applications\cite{Gomony2012}.
With the increasing performance requirements on these devices, not only faster processors are needed, but also high-performance memory systems, namely \revabbr{dynamic random-access memories}{DRAMs}, which are supposed to deliver a lot of bandwidth at a low latency.
While these storage systems are very complex and offer a lot of room for configuration, e.g., the DRAM standard, the memory controller configuration or the address mapping, there are different requirements for the very different applications \cite{Gomony2012}.
Consequently, system designers are entrusted with the complex task of finding the most effective configurations that match the performance and power contraints with good optimizations applied for the specific use case.
\input{img/thesis.tikzstyles}
@@ -15,11 +15,11 @@ Consequently, system designers are entrusted with the complex task of finding th
\end{center}
\end{figure}
For the exploration of the design space of these configurations it is impractical to use real systems as they are too cost-intensive and not modifyable and therefore not suitable for rapid prototyping.
For the exploration of the design space of these configurations, it is impractical to use real systems as they are too cost-intensive and not modifyable and therefore not suitable for rapid prototyping.
To overcome this limitation, it is important to simulate the memory system using a simulation framework with sufficient accuracy.
Such a simulation framework is DRAMSys \cite{Steiner2020, Jung2017}, which is based on SystemC \revabbr{transaction level modeling}{TLM} and enables the fast simulation of numerous DRAM standards and controller configurations with cycle-accuracy.
Stimuli for the memory system can either be generated using a prerecorded trace file with timestamps, a traffic generator that acts as a state machine and initiates different request patterns, or a detailed processor model of the gem5 \cite{Binkert2011} simulation framework.
Stimuli for the memory system can either be generated using a prerecorded trace file with timestamps, a traffic generator that acts as a state machine and initiates different request patterns or a detailed processor model of the gem5 \cite{Binkert2011} simulation framework.
However, the two former methods lack in accurary whereas the latter may provide the sufficient precision but is a very time-consuming effort.
To fill this gap of fast but accurate traffic generation, a new simulation frontend for DRAMSys is developed and presented in this thesis.
@@ -35,4 +35,4 @@ Section \ref{sec:systemc} presents the modeling language SystemC, on which the d
After that, Section \ref{sec:caches} gives a short overview of modern cache architectures and their high-level implementations.
Section \ref{sec:dramsys} introduces the DRAMSys simulation framework and its basic functionalities.
Section \ref{sec:implementation} explains the implementation of the cache model, the processor model and the instrumentation tool in detail.
In Section \ref{sec:simulation_results} the accuracy of the new framework is compared against the gem5 and Ramulator \cite{Kim2016} simulators, whereas Section \ref{sec:future_work} denotes future improvements that can be achieved.
In Section \ref{sec:simulation_results} the accuracy of the new framework is compared against the gem5 and Ramulator \cite{Kim2016} simulators, whereas Section \ref{sec:future_work} finally denotes future improvements that can be achieved.

View File

@@ -17,7 +17,7 @@ However, those context switches result in a significant performance penalty as t
DBI tools can either invoke the target application by themselfes or are attached to the application's process dynamically.
The former method allows instrumentation of even the early startup stage of the application whereas the latter method might be used if the application has to be first brought into a certain state or the process cannot be restarted due to reliability reasons.
Some DBI tools also allow to directly implement the DBI framework into the applications source code.
While this removes the flexibility of observing applications that are only available in binary form, it enables the control over the DBI tool using its application interface.
While this eliminates the flexibility of observing applications that are only available in binary form, it enables the control over the DBI tool using its application interface.
With this method, it is possible to precisely instrument only a specific code region of interest and otherwise disable the tool for performance reasons.
In all cases, the instrumentation tool executes in the same process and address space as the target application.
@@ -53,7 +53,7 @@ A basic block is a sequence of instructions extracted from the target applicatio
In the code cache, the instrumentation instructions will directly be inserted.
To be able to execute the modified code, basic blocks in the code cache are extended by two \textit{exit stubs}, ensuring that at the end the control is transferred back to DynamoRIO via a context switch.
From there the applications and processor state is saved and the next basic block will be copied into the code cache, modified and executed after restoring the previously saved state.
From there the application's and processor's state is saved and the next basic block is copied into the code cache, modified and executed after restoring the previously saved state.
Basic blocks that are already located in the code cache are directly executed without copying, however, a context switch is still needed to determine the next basic block to execute.
To reduce this overhead and avoid a context switch, DynamoRIO can \textit{link} two basic blocks together that were targeted by a direct branch, i.e., branches whose target address will not change during runtime.
@@ -71,13 +71,13 @@ The application code is loaded by the dispatcher, modified by the basic block bu
\begin{figure}
\begin{center}
\tikzfig{img/dynamorio}
\caption{DynamoRIO runtime code manipulation layer \cite{Bruening2004}.}
\caption[DynamoRIO runtime code manipulation layer.]{DynamoRIO runtime code manipulation layer \cite{Bruening2004}.}
\label{fig:dynamorio}
\end{center}
\end{figure}
As mentioned in Section \ref{sec:dbi}, it is important for a DBI tool to operate transparently.
DynamoRIO takes a number of measures to achieve this goal, some of which will now be explained \cite{Bruening2004}.
DynamoRIO takes a number of measures to achieve this goal, some of which are now explained \cite{Bruening2004}.
As sharing libraries with the target application can cause transparency issues, especially when using non-reentrant routines or routines that alter static state such as error codes, DynamoRIO directly interfaces with the system using system calls and even avoids to use the C standard library (e.g., \textit{glibc} on Linux).
The same should also apply for user-written instrumentation clients (introduced in more detail in Section \ref{sec:dynamorio_client}), but the direct usage of system calls is discouraged as this bypasses the internal monitoring of DynamoRIO for changes that affect the processes address space.
Instead, DynamoRIO provides a cross-platform API for generic routines as file system operations and memory allocation.
@@ -103,7 +103,7 @@ DynamoRIO provides a programming interface to develop external so-called \textit
Clients are user-written instrumentation tools and make it possible to dynamically modify the basic blocks, either to alter the application behavior or to insert observational instructions.
A DynamoRIO client is compiled into a shared library and passed to the \textit{drrun} utility using a command line option.
Clients implement a number of hook functions that will be called by DynamoRIO for certain events such as the creation of a basic block or of a trace.
Generally, there are two classes of hooks: those that execute on basic block creation instrument all of the application code and those that execute on trace generation are only interested in frequently executed code.
Generally, there are two classes of hooks: those that execute on basic block creation, which instrument all of the application code, and those that execute on trace generation, which are only interested in frequently executed code.
It is important to note that the hooks for basic block and trace generation are not called every time when this code sequence is executed, but only when these basic blocks are generated and placed into the code cache.
So the required instructions have to be inserted into the basic block instruction stream in this stage, rather than implementing the observational or manipulative behavior in the hook function itself.

View File

@@ -3,7 +3,7 @@
This section covers the basics of virtual prototyping, SystemC and transaction level modeling.
\revabbr{Virtual prototypes}{VPs} are software models of physical hardware systems, can be used for software development before the actual hardware is available.
\revabbr{Virtual prototypes}{VPs} are software models of physical hardware systems, that can be used for software development before the actual hardware is available.
They make it easier to test the product as VPs provide visiblity and controllability across the entire system and therefore reduce the time-to-market and development cost \cite{Antonino2018}.
SystemC is a C++ class library with an event-driven simulation kernel, used for developing complex system models (i.e., VPs) in a high-level language.
@@ -25,9 +25,8 @@ Moreover, there is the event queue type \texttt{sc\_event\_queue}, which makes i
The concepts presented are used in Section \ref{sec:implementation}, where the implementation of various SystemC modules will be discussed.
SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximately-timed} and \textit{loosley-timed}.
The latter two abstraction levels belog to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}.
Another level of abstraction, \textit{untimed}, will not be the subject of this thesis.
SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{untimed}, \textit{approximately-timed} and \textit{loosley-timed}.
The latter two abstraction levels belong to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}.
\subsection{Transaction Level Modeling}
\label{sec:tlm}
@@ -54,10 +53,10 @@ GPs are passed as references, so they do not need to be copied between modules.
SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximately-timed}{AT}.
In the LT coding style, a transaction is blocking, meaning that the transaction will be modeled by only one function call.
This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is complete before making the next request.
This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is completed before making the next request.
However, the fast simulation time, especially when the so-called concept of \textit{temporal decoupling} with \textit{timing quantums} is used, makes it possible to use this coding style for rapid software development; LT is suitable for developing drivers for a simulated hardware component.
The AT coding style is non-blocking and therefore can be used to model with a higher timing accuracy than LT.
The AT coding style is non-blocking and can therefore be used to model with a higher timing accuracy than LT.
This high accuracy makes it possible to use AT for hardware-level design space exploration.
With AT, a special protocol is used that uses a four-phase handshake:
\texttt{BEGIN\_REQ},
@@ -98,7 +97,7 @@ However, since the initiator is blocked due to backpressure during this period,
Another form of this shortcut is the combination with return path of the forward transport function call.
Here, the return path is used to directly send the \texttt{BEGIN\_REQ} phase, without invoking the backward transport function altogether, reducing the required number of transport calls to only two.
The last shortcut, that can be made is the so-called \textit{early completion}.
The last shortcut that can be made is the so-called \textit{early completion}.
When the target receives the \texttt{BEGIN\_REQ} phase, it can already place the requested data into the payload and pass \texttt{TLM\_COMPLETED} as the return value back to the initiator.
This notifies that the whole transaction is already completed at this point, so no further transport calls are required.
Note that this form of early completion is very similar to the LT coding style, where a transaction also is modeled using only one function call.
@@ -108,7 +107,7 @@ Here, \texttt{TLM\_COMPLETED} is returned during the backward transport call of
SystemC also supports additional user-defined phases through its \texttt{DECLARE\_EXTENDED\_\\PHASE()} macro for special cases.
In contrast to the TLM-LT protocol, TLM-AT allows to model the pipelining of transactions; multiple transactions can be processed simultaneously by one target.
The responses also do not need to be in the same order as the initiator has sent them; they can be \textit{out out order}.
The responses also do not need to be in the same order as the initiator has sent them; they can be \textit{out-of-order}.
The TLM-AT coding style is the protocol used to implement the processor model and the cache model in Section \ref{sec:implementation} of this thesis.
Some of the earlier described shortcuts are taken advantage of throughout those models.

View File

@@ -2,13 +2,13 @@
\label{sec:caches}
In this section, the necessity and functionality of caches in modern computing systems is explained as well as the required considerations resulting from virtual memory addressing.
A special focus is also be placed on non-blocking caches.
The theory is be based on the chapters \textit{``An Overview of Cache Principles''} and \textit{``Logical Organization''} of \cite{Jacob2008} and on \cite{Jahre2007}.
A special focus is also placed on non-blocking caches.
The theory is based on the chapters \textit{``An Overview of Cache Principles''} and \textit{``Logical Organization''} of \cite{Jacob2008} and on \cite{Jahre2007}.
With the advancement of faster multi-core processors, the performance difference to the main memory is increasing, commonly referred to as the \textit{memory wall}.
Therefore, caches, whose goal is to decrease the latency and increase the bandwidth of a memory access, play an important role when it comes to the overall performance of computing systems.
Caches are faster than DRAM, but only provide a small capacity, as the area cost is a lot higher.
Caches are faster than DRAM, but only provide a small capacity as the area cost is a lot higher.
For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache to improve performance.
The two most important heuristics that make this possible will be explained in Section \ref{sec:caches_locality_principles}.
@@ -46,13 +46,13 @@ Here, the program causes the cache to fetch more than one cache line from the un
This section revolves about the question where to store the retrieved data in the cache.
Because the cache is much smaller than the DRAM, only a subset of the memory can be held in the cache at a time.
Into which cache line a block of memory is placed is determined by the \textit{placement policy}.
The cache line into which a block is placed is determined by the \textit{placement policy}.
There are three main policies:
\begin{itemize}
\item
In \textit{direct-mapped caches} the cache is divided into multiple sets with a single cache line in each set.
For every address there is only one cache line where the data can be placed in.
For each address there is only one cache line where the data can be placed in.
\item
In a \textit{fully associative cache} there is only one large set, containing all available cache lines.
Referenced data has no restriction in which cache line it can be placed.
@@ -84,12 +84,12 @@ An example subdivision of the address in the index, tag and byte offset is shown
\end{center}
\end{figure}
Directly-mapped caches have the advantage, that only one tag has to be compared with the address.
Direct-mapped caches have the advantage that only one tag has to be compared with the address.
However, every time new data is referenced that is placed into the same set, the cache line needs to be evicted.
This leads to an overall lower cache hit rate compared to the other two policies.
In a fully associative cache, a memory reference can be placed anywhere, consequently all cache lines have to be fetched and compared to the tag.
Although this policy has the highest potential cache hit rate, the area cost due to additional comparators and high power consumption due to the lookup process, makes it non-feasible for many systems.
Although this policy has the highest potential cache hit rate, the area cost due to additional comparators and high power consumption due to the lookup process, make it non-feasible for many systems.
The hybrid approach of set-associative caches offers a trade-off between both policies.
The term \textit{associativity} denotes the number of cache lines that are contained in a set.
@@ -108,8 +108,8 @@ To determine which cache line in the corresponding set is evicted, there are sev
An LRU algorithm is expensive to implement, as a counter value for every cache line of a set has to be updated every time the set is accessed.
\item
An alternative is a \revabbr{pseudo LRU}{PLRU} policy, where an extra bit is set to 1 every time a cache line is accessed.
When the extra bit of every cache line in a set is set to 1, they will get reset to 0.
In case of contention, the first cache line whose extra bit is 0 will be evicted, which indicates that the last usage was likely some time ago.
When the extra bit of every cache line in a set is set to 1, they are reset to 0.
In case of contention, the first cache line whose extra bit is 0 is evicted, which indicates that the last usage was likely some time ago.
\item
In the \revabbr{least frequently used}{LFU} policy, every time a cache line is accessed, a counter value will be increased.
The cache line with the lowest value, the least frequently used one, will be chosen to be evicted.
@@ -162,7 +162,7 @@ Such a cache is called \textit{virtually indexed} and \textit{physically tagged}
\begin{figure}
\begin{center}
\tikzfig{img/virtual_address_conversion}
\caption[Virtually indexed, physically tagged cache \cite{Jacob2008}.]{Virtually indexed, physically tagged cache \cite{Jacob2008}. ASID refers to address-space identifier.}
\caption[Virtually indexed, physically tagged cache.]{Virtually indexed, physically tagged cache \cite{Jacob2008}. ASID refers to address-space identifier.}
\label{fig:virtual_address_conversion}
\end{center}
\end{figure}
@@ -182,7 +182,7 @@ Therefore, it is important to guarantee \textit{cache coherency}.
One of the solutions for cache coherency is the use of a so-called snooping protocol.
A cache will snoop the cache coherence bus to examine if it already has a copy of requested data.
Snooping packets are then used to update or invalidate other copies of the data.
Snooping protocols are complex and difficult to formally verify that they in fact guarantee cache coherence.
Snooping protocols are complex and it is difficult to formally verify that they in fact guarantee cache coherence.
For this reason, they are not further discussed in this thesis.
\subsection{Non-Blocking Caches}
@@ -198,7 +198,7 @@ An MSHR entry always corresponds to one cache line that is currently being fetch
There are two variants of cache misses: \textit{primary misses} are misses that lead to another occupation of an MSHR, where as \textit{secondary misses} are added to an existing MSHR entry and therefore cannot cause the cache to block.
This is the case when the same cache line as accessed.
An architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}.
A possible architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}.
\begin{figure}
\begin{center}

View File

@@ -4,7 +4,7 @@
DRAMSys is an open-source design space exploration framework, capable of simulating the latest \revabbr{Joint Electron Device Engineering Council}{JEDEC} DRAM standards.
It is optimized to achieve high simulation speeds and utilizes the TLM-AT coding style while still achieving cycle-accurate results \cite{Steiner2020}.
DRAMSys is composed of an arbitration \& mapping unit (also called arbiter) and independent channel controllers, each driving one DRAM channel.
DRAMSys is composed of an arbitration and mapping unit (also called arbiter) and independent channel controllers, each driving one DRAM channel.
The general architecture of DRAMSys is illustrated in Figure \ref{fig:dramsys}.
\begin{figure}[!ht]
@@ -35,7 +35,7 @@ The channel controller is the most important module of the DRAM simulation, cons
New incoming requests are placed into the scheduler.
The purpose of the scheduler is to group transactions by their corresponding memory bank and reorder the payloads according to a predefined policy.
Available policies are, for example, the \textit{first-in, first-out} or the \textit{first-ready - first-come, first-served} policy.
The former policy does not reorder payloads and therefore optimizes for a short response latency and whereas the latter policy does reorder payloads and optimizes for a high memory bandwidth.
The former policy does not reorder payloads and therefore optimizes for a short response latency, whereas the latter policy reorders payloads and optimizes for a high memory bandwidth.
A bank machine, whose responsibility is to manage the state of its corresponding memory bank, then fetches the next transaction from the scheduler.
There are also a number of available policies for the bank machines, each of which determine in which state the bank is being held after memory request is completed.
@@ -52,12 +52,12 @@ The selected command is then sent to the DRAM by the controller.
The last important module to mention is the response queue.
A completed DRAM transaction is enqueued into the response queue by the controller to send the responses back to the initiators.
In the response queue, transactions can either be returned to the initiator according to the scheme \textit{first-in, first-out} or be reordered in the queue.
A reordering might be necessary to be able to support initiators that can not handle \textit{out-out-order} responses.
A reordering might be necessary to be able to support initiators that can not handle \textit{out-of-order} responses.
% Evtl TA falls Bilder genutzt werden?
DRAMSys also provides the so-called \textit{Trace Analyzer}, a graphical tool that visualizes database files created by DRAMSys.
% It makes visible the \texttt{REQ} and \texttt{RESP} phases between the initiator and the arbiter, the occupation of the command bus and data bus as well as representations of the different phases in the DRAM banks.
An example trace database, visualized in the Trace Analyzer is shown in Figure \ref{fig:traceanalyzer}.
An exemplary trace database, visualized in the Trace Analyzer, is shown in Figure \ref{fig:traceanalyzer}.
Furthermore, the Trace Analyzer is capable of calculating numerous metrics and creating plots of interesting characteristics.
\begin{figure}

View File

@@ -3,7 +3,7 @@
In this section, the developed components for the new simulator frontend, which enable the tracing of an arbitrary application in real-time, as well as the replay of the recorded traces in DRAMSys, will be introduced.
To briefly summarize which components are necessary to implement the new simulation frontend, they are briefly listed below:
To briefly summarize which components are necessary to implement the new simulation frontend, they are listed below:
\begin{itemize}
\item A DynamoRIO client that traces memory accesses from a running application.
@@ -18,7 +18,7 @@ The last part will concentrate on the special architecture of the new trace play
\subsection{Analysis Tool}
\label{sec:analysis_tool}
As described in Section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
As described in Section \ref{sec:dynamorio}, the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrMemtrace framework, which comes bundled with DynamoRIO, is used.
DrCacheSim is a DynamoRIO client that builds on top of the DrMemtrace framework, which gathers memory and instruction access traces from the target application and forwards them to one or multiple analysis tools.
@@ -65,7 +65,7 @@ In the case of online tracing, DrCacheSim consists of two separate processes:
\end{itemize}
The \revabbr{inter-process communication}{IPC} between the two processes is achieved through a \textit{named\ pipe}.
Figure \ref{fig:drcachesim} illustrates the structure the of online tracing mechanism.
Figure \ref{fig:drcachesim} illustrates the structure of the online tracing mechanism.
A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
@@ -93,7 +93,7 @@ Lastly, the analysis tool inserts a timestamp into the trace for every received
The use of this timestamp will be further explained in Section \ref{sec:dbiplayer_functionality}.
Listing \ref{list:memtrace} presents an exemplary memory trace.
Lines consisting of a number between two angle brackets represent a timestamp whereas lines for memory references consist of the instruction count, a character denoting a read or write, the size and the physical address of the access.
Also, comments which, are ignored by the trace player, can be added by starting the line with a number sign.
Also, comments which are ignored by the trace player can be added by starting the line with a number sign.
\begin{listing}
\begin{textcode}
@@ -131,9 +131,8 @@ This section covers the general architecture of the \textit{DbiPlayer}, the new
For every recorded thread, a traffic initiator thread, a so-called \textit{DbiThreadPlayer}, is spawned, which is a standalone initiator for memory transactions.
Because those threads need to be synchronized to approximate real thread interleaving, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in Section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that allows components to be connected to DRAMSys whose internal architecture can be arbitrary.
With the old DRAMSys interface for trace players, this was not easily realizable, so a new generic initiator interface was developed that allows components to be connected to DRAMSys whose internal architecture can be arbitrary.
This new interface will be further discussed in Section \ref{sec:traceplayer_interface}.
For the \textit{DbiPlayer}, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} into a single \texttt{multi\_passthrough\_initiator\_socket}.
@@ -199,12 +198,12 @@ It has a configurable size, associativity, cache line size, MSHR buffer depth, w
To understand how the cache model works, a hypothetical request from the CPU will be assumed to explain the internal processing of the transaction in detail:
When the transaction arrives, it will be placed in the PEQ of the cache from where, after the specified amount of delay has elapsed, the handler for the \texttt{BEGIN\_REQ} phase is called.
When the transaction arrives, it will be placed in the PEQ of the cache, from where, after the specified amount of delay has elapsed, the handler for the \texttt{BEGIN\_REQ} phase is called.
The handler verifies that the cache buffers are not full\footnote{Otherwise the cache will apply backpressure on the CPU and postpone the handling of the transaction.} and checks if the requested data is stored in the cache.
If it is the case (i.e., a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with a response, accesses the cache and sends the \texttt{BEGIN\_RESP} phase to the processor.
During a cache access, the content of the cache line is copied into the transaction in case of a read request, or the cache line is updated with the new value in case of a write request.
Further, in both cases the timestamp of the last access is updated to the current simulation time.
The processor then finalizes the transaction with the \texttt{END\_RESP} phase, the target backpressure of the cache will be cleared a the postponed request from the CPU (if it exists) is placed into the PEQ once again.
The processor then finalizes the transaction with the \texttt{END\_RESP} phase, the target backpressure of the cache will be cleared as the postponed request from the CPU (if it exists) is placed into the PEQ once again.
If, on the other hand, the requested data is not in the cache (i.e., a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
If this is the case\footnote{And if the target list of the MSHR entry is not full. Otherwise the transaction will be postponed.}, the transaction is appended to it as an additional target.
@@ -240,7 +239,6 @@ The rough internal structure of the cache model is shown again in Figure \ref{fi
It is to note that the current implementation does not utilize a snooping protocol.
Therefore, cache coherency is not guaranteed and memory shared between multiple processor cores will result in incorrect results as the values are not synchronized between the caches.
However, it is to expect that this will not drastically affect the simulation results for applications with few shared resources.
The implementation of a snooping protocol is a candidate for future improvements.
\subsection{Trace Player Interface}
\label{sec:traceplayer_interface}
@@ -267,7 +265,7 @@ While the implementation of the \textit{MultiCoupler} component is trivial as it
In order to understand why this buffering is needed, consider the scenario where the L3 cache applies backpressure to one L2 cache.
The L2 cache is not allowed to send further requests due to the exclusion rule.
But since the target socket of the L3 cache is occupied, this also applies to all other other L2 caches.
But since the target socket of the L3 cache is occupied, this also applies to all other L2 caches.
This information, however, is not propagated to the other caches, leading to an incorrect behavior if not addressed, as the other caches will send further requests.
To solve this problem, the MultiSimpleCoupler only forwards requests to the L3 cache when it is able to accept them.

View File

@@ -1,7 +1,7 @@
\section{Simulation Results}
\label{sec:simulation_results}
This section evaluates the accuracy of the new simulation front-end.
This section evaluates the accuracy of the new simulation frontend.
After a short discussion about the general expections regarding the accuracy and considerations to make, the simulation results will be presented.
The presentation is structured into two parts:
At first simulation statistics of numerous benchmarks are compared against the gem5 \cite{Binkert2011} simulator, which uses detailed processor models and can be considered as a reference.
@@ -11,7 +11,7 @@ Secondly, the new simulation frontend is compared against the memory access trac
Generating memory access traces using dynamic binary instrumentation as a faster alternative to the simulation of detailed processor models introduces several inaccuracies, which of some will now be enumerated.
The most important aspect to consider is that DBI can only instrument the target application but fails to also take the operating system the application is running on into account.
That includes the inability to observe the execution of kernel routines that are directly invoked by the application through system calls, but also the preemtive scheduling of other programs that are running on the system at the same time.
That includes the inability to observe the execution of kernel routines that are directly invoked by the application through system calls, but also the preemptive scheduling of other programs that are running on the system at the same time.
The fetching of the instructions themselves should also be considered:
In a real system the binary executable of the target application is placed in the DRAM, along with its data, and is loaded into the instruction cache while executing.

View File

@@ -1,17 +1,17 @@
\section{Future Work}
\section{Conclusion and Future Work}
\label{sec:future_work}
Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
It not only has an high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, that is fast and makes only few compromises on accuracy.
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, which fastens the process while only making few compromises on accuracy.
In conclusion, the newly developed instrumentation tool provides an flexible way of generating traces for arbitrary multi-threaded applications.
In conclusion, the newly developed instrumentation tool provides a flexible way of generating traces for arbitrary multi-threaded applications.
The mature DRAMSys simulator framework then can be used to explore the design space and vary numerous configuration parameters of the DRAM subsystem to find a well-suited set of options.
It was shown that in comparison to the well-established full-system simulation framework gem5, only some deviations have to be accepted.
Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new fronted. %(ergenisse kurz hier zusammenfassen)
Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new frontend. %(ergenisse kurz hier zusammenfassen)
Although Ramulator takes a slightly different approach to trace generation than this thesis, a very good correlation in the results could be demonstrated.
A noteworthy advantage of the newly developed tool is its support for all hardware architectures that DynamoRIO provides (currently IA-32, x86-64, ARM, and AArch64) in contrast to the supported architectures of Pin (IA-32 and x86-64).
@@ -23,7 +23,7 @@ As mentioned in \ref{sec:cache_implementation}, the cache models do not yet guar
Although this can be a complex task, it is possible to implement this in future work.
A less impactful inaccuracy results from the scheduling of the applications threads in the new simplified core models.
While an application can spawn a arbitrary number of threads, the platform may not be able to process them all in parallel.
While an application can spawn an arbitrary number of threads, the platform may not be able to process them all in parallel.
Currently, the new trace player does not take this into account and runs all threads in parallel.
This deviation could be prevented by recording used processor cores on the initial system and using this information to better match the scheduling.
@@ -42,5 +42,3 @@ In the future, the DynamoRIO tool could decode those computational instructions
One significant improvement that still could be applied is the consideration of dependencies between the memory accesses.
Similarily to the elastic trace player of gem5 \cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
By using this technique, it is possible to also model out-of-order behavior of modern processors and make the simulation more accurate, whereas the current implementation is entirely in-order.
These mentioned potential improvements could make the new simulation frontend for DRAMSys even more accurate.