Last fixes

This commit is contained in:
2022-08-15 08:18:13 +02:00
parent faf2842687
commit 9ec2f6f1eb
7 changed files with 30 additions and 30 deletions

BIN
doc.pdf Normal file

Binary file not shown.

View File

@@ -10,13 +10,13 @@ The explained topics are mainly based on the chapters \textit{``DynamoRIO''}, \t
\revabbr{Dynamic binary instrumentation}{DBI} is a method to analyze, profile, manipulate and optimize the behavior of a binary application while it is executed.
This is achieved through the injection of additional instructions into the instruction trace of the target application, which either accumulate statistics or intervene the instruction trace.
In comparison, debuggers use special breakpoint instructions (e.g. INT3 on x86 or BKPT on ARM) that are injected at specific places in the code, raising a debug exception when reaching it.
In comparison, debuggers use special breakpoint instructions (e.g., INT3 on x86 or BKPT on ARM) that are injected at specific places in the code, raising a debug exception when reaching it.
At those exceptions a context switch to the operating system kernel will be performed.
However, those context switches result in a significant performance penalty as the processor state has to be saved and restored afterwards, making it less efficient than DBI.
DBI tools can either invoke the target application by themselfes or are attached to the application's process dynamically.
The former method allows instrumentation of even the early startup stage of the application whereas the latter method might be used if the application has to be first brought into a certain state or the process cannot be restarted due to reliability reasons.
Some DBI tools also allow to directly implement the DBI framework into the applications source code.
Some DBI tools also allow to directly integrate the DBI framework into the applications source code.
While this eliminates the flexibility of observing applications that are only available in binary form, it enables the control over the DBI tool using its application interface.
With this method, it is possible to precisely instrument only a specific code region of interest and otherwise disable the tool for performance reasons.
@@ -87,7 +87,7 @@ Client code should also not alter the application stack in any way, as some spec
Alternatively, DynamoRIO provides a separate stack that should be used instead to store temporary data.
To remain undetected, it is also required for DynamoRIO to protect its own memory from malicious reads or writes from the application.
Those should, like in the native case, raise an exception as unallocated data is accessed.
However, as these memory regions are actually allocated, DynamoRIO has to produce those execption itself to remain transparent.
However, as these memory regions are actually allocated, DynamoRIO has to produce those exceptions itself to remain transparent.
When the application branches to a dynamically calculated address, DynamoRIO has to translate this address to the corresponding address of the basic block in the code cache.
But also in the backward case, whenever a code cache address is exposed to the application, it has to be converted back to the corresponding address to the mapped address region of the binary executable.
@@ -110,9 +110,9 @@ So the required instructions have to be inserted into the basic block instructio
Table \ref{tab:dynamorio_api} lists some of the most important hooks that a client can implement.
\begin{table}
\caption{Client routines that get called by DynamoRIO \cite{Bruening2003}.}
\caption[Client routines that are called by DynamoRIO.]{Client routines that are called by DynamoRIO \cite{Bruening2003}.}
\begin{center}
\begin{tabular}{|p{0.55\linewidth} | p{0.35\linewidth}|}
\begin{tabular}{|p{0.6\linewidth} | p{0.4\linewidth}|}
\hline
Client Routine & Description\\
\hline

View File

@@ -46,7 +46,7 @@ GPs are passed as references, so they do not need to be copied between modules.
\begin{figure}
\begin{center}
\tikzfig{img/tlm}
\caption[Forward and backward path between TLM sockets \cite{Menard2017}.]{Forward and backward path between TLM sockets \cite{Menard2017}. $\blacksquare$ denotes an initiator socket, $\square$ denotes a target socket.}
\caption[Forward and backward path between TLM sockets.]{Forward and backward path between TLM sockets \cite{Menard2017}. $\blacksquare$ denotes an initiator socket, $\square$ denotes a target socket.}
\label{fig:tlm}
\end{center}
\end{figure}
@@ -67,8 +67,8 @@ With AT, a special protocol is used that uses a four-phase handshake:
When an initiator requests data from a target, it starts the transaction with the \texttt{BEGIN\_REQ} phase by calling its \texttt{nb\_transport\_fw()} method.
This method in turn calls the receiving module's target socket and the target module then enqueues the payload into its \revabbr{payload event queue}{PEQ}.
The PEQ pretends it has received the payload after the delay, that the initiator has specified with its call to the transport method.
If the target is not yet ready to accept a new request, it defers its \texttt{END\_REQ} phase until it is ready.
During this time, the initiator is blocked from sending further requests either to this or other modules as the target applies \textit{backpressure} on the initiator.
If the target is not yet ready to accept the new request, it defers its \texttt{END\_REQ} phase until it is ready.
During this time, the initiator is blocked from sending further requests to this module as the target applies \textit{backpressure} on the initiator.
This concept is called the \textit{exclusion rule}.
Otherwise, the target directly responds the \texttt{END\_REQ} phase back to the initiator.
@@ -94,8 +94,8 @@ Analogously, it is also possible for the initiator to directly respond with the
Besides this, it is also possible for the target to directly respond with the \texttt{BEGIN\_RESP} phase after it has received the \texttt{BEGIN\_REQ} phase and therefore skip the \texttt{END\_REQ}.
The initiator has to react accordingly and must detect that the \texttt{END\_REQ} has been skipped.
However, since the initiator is blocked due to backpressure during this period, this shortcut should only be used if the response is ready to send after a short delay.
Another form of this shortcut is the combination with return path of the forward transport function call.
Here, the return path is used to directly send the \texttt{BEGIN\_REQ} phase, without invoking the backward transport function altogether, reducing the required number of transport calls to only two.
Another form of this shortcut is the combination with the return path of the forward transport function call.
Here, the return path is used to directly send the \texttt{BEGIN\_RESP} phase, without invoking the backward transport function altogether, reducing the required number of transport calls to only two.
The last shortcut that can be made is the so-called \textit{early completion}.
When the target receives the \texttt{BEGIN\_REQ} phase, it can already place the requested data into the payload and pass \texttt{TLM\_COMPLETED} as the return value back to the initiator.

View File

@@ -21,7 +21,7 @@ Finally, the advantage of non-blocking caches is the topic of Section \ref{sec:c
\label{sec:caches_locality_principles}
Access patterns of a typical application are not random.
They tend to repeat in time or are located in the near surrounding of previous accesses.
They tend to repeat themselfes in time or are located in the near surrounding of previous accesses.
Those two heuristics are called \textit{temporal locality} and \textit{spatial locality}.
\subsubsection{Temporal Locality}
@@ -65,7 +65,7 @@ There are three main policies:
\begin{figure}
\begin{center}
\tikzfig{img/associativity}
\caption{Four organizations for a cache of eight blocks \cite{Jacob2008}.}
\caption[Four organizations for a cache of eight blocks.]{Four organizations for a cache of eight blocks \cite{Jacob2008}.}
\label{fig:associativity}
\end{center}
\end{figure}
@@ -154,11 +154,11 @@ Figure \ref{fig:virtual_address} shows an exemplary division of a virtual addres
Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number.
For conversions, so-called \textit{page tables} are used to look up the physical page number.
Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive.
Page tables are usually multiple levels deep (e.g., 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive.
To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers.
However, as long as the physical address is not present, the data cache cannot look up its entries as the index is not known yet.
So the cache has to wait for the TLB or even multiple memory accesses if the physical page number is not stored in it.
So the cache has to wait for the TLB or even multiple memory accesses in case the physical page number is not stored in it.
To circumvent this problem, the cache can be indexed by the virtual address, which makes it possible to parallelize both procedures.
Such a cache is called \textit{virtually indexed} and \textit{physically tagged} and is illustrated in Figure \ref{fig:virtual_address_conversion}.
@@ -207,7 +207,7 @@ A possible architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_f
\begin{figure}
\begin{center}
\tikzfig{img/mshr_file}
\caption[Miss Status Holding Register File \cite{Jahre2007}.]{Miss Status Holding Register File \cite{Jahre2007}. \textit{V} refers to a valid bit.}
\caption[Miss Status Holding Register File.]{Miss Status Holding Register File \cite{Jahre2007}. \textit{V} refers to a valid bit.}
\label{fig:mshr_file}
\end{center}
\end{figure}

View File

@@ -10,7 +10,7 @@ The general architecture of DRAMSys is illustrated in Figure \ref{fig:dramsys}.
\begin{figure}[!ht]
\begin{center}
\includegraphics{img/dramsys.pdf}
\caption{Structure of DRAMSys \cite{Jung2017}.}
\caption[Structure of DRAMSys.]{Structure of DRAMSys \cite{Jung2017}.}
\label{fig:dramsys}
\end{center}
\end{figure}
@@ -22,7 +22,7 @@ To support a variety of DRAM standards in a robust and error-free manner, DRAMSy
Using this language, all timing dependencies between DRAM commands of a standard can be defined.
From this formal description, the source code of internal timing checkers is generated, which ensure compliance to the specific standard \cite{Jung2017a}.
Since a single memory access can result in the issuance of multiple commands (e.g. a precharge (\texttt{PRE}), an activate (\texttt{ACT}), a read (\texttt{RD}) or a write (\texttt{WR})), the four-phase handshake of the TLM-AT protocol is not sufficient to model the communication between the DRAM controller and the DRAM device.
Since a single memory access can result in the issuance of multiple commands (e.g., a precharge (\texttt{PRE}), an activate (\texttt{ACT}), a read (\texttt{RD}) or a write (\texttt{WR})), the four-phase handshake of the TLM-AT protocol is not sufficient to model the communication between the DRAM controller and the DRAM device.
Therefore, a custom TLM protocol called DRAM-AT is used as the communication protocol between the channel controller and the DRAM device \cite{Steiner2020}.
This custom protocol introduces a \texttt{BEGIN} and \texttt{END} phase for every available DRAM command.
Which commands can be issued depends on the DRAM standard used.
@@ -60,6 +60,8 @@ DRAMSys also provides the so-called \textit{Trace Analyzer}, a graphical tool th
An exemplary trace database, visualized in the Trace Analyzer, is shown in Figure \ref{fig:traceanalyzer}.
Furthermore, the Trace Analyzer is capable of calculating numerous metrics and creating plots of interesting characteristics.
In Section \ref{sec:implementation} of this thesis, a new simulation frontend for DRAMSys will be developed.
\begin{landscape}
\begin{figure}
\begin{center}
@@ -69,5 +71,3 @@ Furthermore, the Trace Analyzer is capable of calculating numerous metrics and c
\end{center}
\end{figure}
\end{landscape}
In Section \ref{sec:implementation} of this thesis, a new simulation frontend for DRAMSys will be developed.

View File

@@ -74,15 +74,15 @@ For data references, the address and size of the desired access is provided as w
In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a separate file, so that it is possible to decode and disassemble the traced instructions even after the application has exited.
As mentioned earlier, instruction decoding is not natively supported for online tracing, but to work around the problem, the analyzer can examine the memory map of the client-side process and read the encoded instructions from there.
Using command line options, it is also possible to instruct DrCachSim to trace only a portion of an application, rather than everyting from start to finish.
Using command line options, it is also possible to instruct DrCacheSim to trace only a portion of an application, rather than everyting from start to finish.
This region of interest can be specified by the number of instructions after which the tracing should start or stop.
All analysis tools implement the common \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.
Virtual functions, such as \texttt{initialize()} and \texttt{print\_results()}, which are called by the analyzer in appropriate places, should also be implemented.
It is possible for a analysis tool to implement parallel processing of the received \texttt{memref\_t} types by splitting up the trace into \textit{shards}.
However, in this thesis the sequential processing of a single sorted and interleaved trace was used because of missing support for parallel processing for the online execution model.
It is possible for an analysis tool to implement parallel processing of the received \texttt{memref\_t} types by splitting up the trace into \textit{shards}.
However, in this thesis the sequential processing of a single sorted and interleaved trace is used because of missing support for parallel processing for the online execution model.
The newly developed DRAMTracer tool creates a separate trace file for every application thread.
Since it is not known a priori how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
@@ -140,7 +140,7 @@ So the \textit{DbiPlayer} is a hierarchical module that consists of a more compl
As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would completely neglect the caches of today's processors.
Therefore, also a cache model is required whose implementation will be explained in more detail in Section \ref{sec:cache_implementation}.
Many modern cache hierarchies are composed of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
Many modern cache hierarchies are composed of three cache levels: two caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
This cache hierarchy is also reflected in the \textit{DbiPlayer} shown in Figure \ref{fig:dbiplayer_with_caches}, but also more simplistic hierarchies such as an L1 cache for every processor core and one shared L2 cache are configurable.
In order to connect the different SystemC socket types, one additional interconnect is required which is explained in more detail in Section \ref{sec:interconnect}.
@@ -171,7 +171,7 @@ The instruction count field of the trace is used to approximate the delay betwee
Additionally, this count can be multiplied by an approximation of the average \revabbr{cycles per instruction}{CPI} value.
While this does not take into account the type of the instructions executed, it is still a simple approximation that can be used to model the system more accuratly.
The individual initator threads should run by themselves without paying attention to the others; rather, they require synchronization to ensure the simulated system replicates the real running application as closely as possible.
The individual initator threads should not run by themselves without paying attention to the others; rather, they require synchronization to ensure the simulated system replicates the real running application as closely as possible.
The analysis tool appends timestamps into the memory access traces.
When such a timestamp is reached, it will be used to pause the execution of a thread if the global time has not yet reached this far, or to advance the global time when the thread is allowed to continue.
Note that the term global time in this context does not correspond to the SystemC simulation time, but denotes a loose time variable that only the \textit{DbiPlayer} uses to schedule its threads.
@@ -211,7 +211,7 @@ If not, a cache line is evicted to make space for the new cache line that will b
The cache model implements the optimal replacement policy LRU, so the cache line with the last access time, which lies furthest back in the past, is chosen to be evicted.
When an eviction is not possible, the transaction will be postponed.
An eviction is not possible when the selected cache line is allocated but not yet filled with requested data from the underlying cache, the cache line is currently present in the MSHR queue, or a hit for this cache line is yet to be handled.
When the \texttt{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
When the \textit{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
The newly evicted cache line is now \textit{allocated}, but not \textit{valid}.
Then, the transaction is put into an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.
@@ -280,7 +280,7 @@ Until the L3 cache responds with the \texttt{END\_REQ} phase, the interconnect d
When the \texttt{END\_REQ} phase is received, the next transaction from this request buffer is sent to the L3 cache.
After some time, the L3 cache will respond with the requested cache lines.
During this \texttt{BEGIN\_RESP} phase, the L2 cache that requested this line is looked up using the routing table and the payload is sent back to it.
Until the L2 cache responds with an \texttt{END\_RESP}, the exclusion rule has to be honored also here: when a new response from the L3 cache is received, it has to be buffered in another internal data structure until the corresponding target socket binding is clear again.
Until the L2 cache responds with an \texttt{END\_RESP}, the exclusion rule also has to be honored here: when a new response from the L3 cache is received, it has to be buffered in another internal data structure until the corresponding target socket binding is clear again.
Once the L2 cache sends out the \texttt{END\_RESP} phase, the interconnect will forward the \texttt{END\_RESP} to the L3 cache, and initiate new response transactions in case the response buffer is not empty.
In conclusion, this special interconnect module with a multi-target socket and a simple-initiator socket ensures that the exclusion rule is respected in both directions.

View File

@@ -61,7 +61,7 @@ Furthermore, the compiler optimizations are set to \texttt{-Ofast} for all bench
Their access patterns are as followed:
\begin{table}[!ht]
\caption{Access patterns of the micro-benchmark kernels \cite{TheBandwidthBenchmark}.}
\caption[Access patterns of the micro-benchmark kernels.]{Access patterns of the micro-benchmark kernels \cite{TheBandwidthBenchmark}.}
\begin{center}
\begin{tabular}{|c|c|c|}
\hline
@@ -127,7 +127,7 @@ In the following, the simulation results of the new simulation frontend, the gem
Listed in Table \ref{tab:benchmark_gem5_bandwidth_ddr4} are three key parameters, specifically the average memory bandwidth and the number of bytes that has been read or written for the DDR4-2400 configuration.
The results show that all parameters of DRAMSys correlate well with the gem5 statistics.
While for the average bandwidth the DynamoRIO results are on average 31.0\% slower compared to gem5 SE, this deviation is only 11.1\% for gem5 FS.
While for the average bandwidth the DynamoRIO results are on average 31.0\% lower compared to gem5 SE, this deviation is only 11.1\% for gem5 FS.
The numbers for the total amount of bytes read result in a deviation of 35.5\% in comparison to gem5 FS and only to 14.6\% to gem5 SE.
The amount of bytes written, on the other hand, shows a very small deviation of 5.2\% for gem5 FS and only 0.07\% for gem5 SE.
Therefore, it can be stated that almost the same number of bytes were written back to the DRAM due to cache write-backs.
@@ -300,7 +300,7 @@ These tables also provide information about the simulation time of the different
\label{fig:latency_ddr4}
\end{figure}
In order to compare not only the total average bandwidth, but also its behavior over time, all benchmarks were run consecutively on gem5 SE amd DRAMSys and plotted as a bandwidth-time diagram in Figure \ref{fig:data_bus_util}.
In order to compare not only the total average bandwidth, but also its behavior over time, all benchmarks were run consecutively on gem5 SE and DRAMSys and plotted as a bandwidth-time diagram in Figure \ref{fig:data_bus_util}.
\begin{figure}[!ht]
\begin{center}
@@ -336,7 +336,7 @@ In order to compare not only the total average bandwidth, but also its behavior
\label{fig:data_bus_util}
\end{figure}
Similar to the previous comparisons, the bandwidth on average of DRAMSys is marginally lower than gem5.
Similar to the previous comparisons, the average bandwidth of DRAMSys is marginally lower than gem5.
Furthermore, an increased fluctuation around a bandwidth value can be observed.
However, the overall time behavior is the same:
The highest bandwidth value is reached at the beginning of the simulation and the bandwidth drops to a low plateau at the end of the simulation.