bachelor-thesis/inc/6.implementation.tex

\section{Implementation}
\label{sec:implementation}

In this section, the new components that were developed, which enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced.

At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained.
Furthermore, the new trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system.
% Oder auch nicht: ?
The last part will concentrate on the special architecture of the new trace player interface and challenges the internal interconnection solves.

\subsection{Analysis Tool}
\label{sec:analysis_tool}

As described in section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used.

DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool.
It is a purely observational client and does not modify the behavior of the application.

Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system.
The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) in modern kernel versions.
The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline).
As of writing this thesis, the offline tracing mode does not yet support the physical address conversation, so the online mode has to be used.

In case of the online tracing, DrCacheSim consists of two separate processes:
\begin{itemize}
 \item
 A client-side process (the DynamoRIO client) which injects observational instructions into the application's code cache.
 For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated.

 \item
 An analyzer-side process which connects to the client and processes the \texttt{memref\_t} data packets.
 The analyzer-side can contain many analysis tools that operate on those stream of records.
\end{itemize}

The \revabbr{inter-process communication}{IPC} between the two parts is achieved through a \textit{named\ pipe}.
Figure \ref{fig:drcachesim} illustrates the structure of the individual parts.

\input{img/thesis.tikzstyles}
\begin{figure}[!ht]
\begin{center}
\tikzfig{img/drcachesim}
\caption{Structure of the DrCacheSim online tracing.}
\label{fig:drcachesim}
\end{center}
\end{figure}

A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated.
In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a seperate file, so that it is possible to decode named instructions even after the application has exited.
In case of online tracing, the analyzer has to inspect the memory of the client-side process for this.

Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.

The newly developed DRAMTracer tool creates for every thread of the application a seperate trace file.
As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
For every data reference, a new entry in the corresponding trace file is made which contains the size and the physical address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last reference.
This instruction count is used to approximate the delay between the memory accesses when the trace is replayed by DRAMSys.

\begin{listing}
\begin{textcode}
# instruction count,read/write,data size,data address
# <timestamp>
<13295366593324052>
4,r,8,1774ef30
0,r,8,1774ef38
1,w,8,1774ef28
2,w,8,1774ee88
0,r,8,17744728
1,r,8,238c3fb0
\end{textcode}
\caption{Example of a memory access trace with a timestamp.}
\label{list:memtrace}
\end{listing}

As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework.
Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool.

Also, to be able to decode the instructions in the online tracing, a set of patches had to be applied to DynamoRIO.

\subsection{Trace Player Architecture}
\label{sec:dbiplayer_architecture}

This section covers the general architecture of the DbiPlayer, the new trace player for DRAMSys that replays the captured trace files.

For every recorded thread, a new so-called DbiThreadPlayer is spawned, which is a standalone initiator for transactions.
Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary.
This new interface will be further discussed in section \ref{sec:traceplayer_interface}.

For the DbiPlayer, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in figure \ref{fig:dbiplayer_without_caches}.

\begin{figure}
\begin{center}
\tikzfig{img/without_caching}
\caption{Architecture of the DbiPlayer without caches.}
\label{fig:dbiplayer_without_caches}
\end{center}
\end{figure}

As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches of today's processors completely.
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache_implementation}.
Modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
% (vlt hier Literaturreferenz)
This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:dbiplayer_with_caches}.

\begin{landscape}
\begin{figure}
\begin{center}
\tikzfig{img/with_caching}
\caption{Architecture of the DbiPlayer with caches.}
\label{fig:dbiplayer_with_caches}
\end{center}
\end{figure}
\end{landscape}

\subsection{Trace Player Functionality}
\label{sec:dbiplayer_functionality}

With the overall architecture of the initiator introduced, this section explains the internal functionality of the DbiPlayer and its threads.
As mentioned previously, the threads cannot run by themselves, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far, or to advance the global time, when the thread is allowed to run.
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the DbiPlayer uses to schedule its threads.

A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
\begin{enumerate}
\item The main thread at the start of the program is always allowed to run.
\item Threads don't go to sleep when they would produce a deadlock. This is the case when they are the only thread currently running.
\item When a previous running thread exits and all other threads are sleeping, then they will be woken up.
\item As a fallback, when currently all threads are waiting, one thread will be woken up.
\end{enumerate}

Those rules ensure that always at least one thread is running and the simulation does not come to a premature halt.

Each running thread iterates through its trace file and initiates the transactions to the specified physical address.
The instruction count field is used to approximate the delay between the memory accesses:
The value is multiplied with the trace player clock and delays the next transaction by the result.
While this does not take the type of the executed instructions into account, it is still a simple approximation that can be made.

\subsection{Non-Blocking Cache}
\label{sec:cache_implementation}

This section gives an overview over the cache model that is used by the new trace player.
It is implemented as a non-blocking cache that, as explained in section \ref{sec:caches_non_blocking_caches}, can accept new requests even when multiple cache misses are being handled.

The cache inherits from the \texttt{sc\_module} base class and has a target socket, to accept requests from the processor or higher level cache, as well as an initiator socket, to send requests to a lower level cache or to the DRAM subsystem.
It has a configurable size, associativity, cache line size, MSHR buffer depth, write buffer depth and target depth for one MSHR entry.

To understand how the cache model works, a hypothetical request from the CPU will be assumed to explain the internal processing of the transaction in detail:

When the transaction arrives, it will be placed in the PEQ of the cache from where the handler for the \texttt{BEGIN\_REQ} phase is called.
The handler verifies that the cache buffers are not full\footnote{Otherwise the cache will apply back pressure on the CPU and postpone the handling of the transaction.} and checks if the requested data is stored in the cache.
If it is the case (i.e. a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with an response, accesses the cache\footnote{In case of a read transaction, the content of the cache line is copied into the transaction; in case of a write transaction, the cache line is updated with the new value.} and sends the \texttt{BEGIN\_RESP} phase to the processor.
The processor then finalizes the transaction with \texttt{END\_RESP}, the target back pressure of the cache will be cleared, and the postponed request from the CPU (if it exists) is now placed into the PEQ again.

On the other hand, when the requested data is not in the cache (i.e. a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
If it is the case\footnote{And if the target list of the MSHR entry is not full. When this is the case, the transaction is postponed.}, the transaction is appended to it as an additional target.
If not, a cache line is evicted\footnote{When an eviction is not possible, the transaction is postponed.} to make space for the new cache line that will be fetched from the underlying memory.
When the \texttt{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
The newly evicted cache line is now \textit{allocated}, but not \textit{valid}.
Then, the transaction is put in an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.

To process the entries in the MSHR and in the write buffer, the \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} methods are called at appropriate times.
In the former, a not yet issued MSHR entry is selected for which a new fetch transaction is generated and sent to the underlying memory.
Note that special care has to be taken when the requested cache line is also present in the write buffer:
To ensure consistency, no new request is sent to the DRAM and instead the value is snooped out of the write buffer.
In the latter, the processing of the write back buffer, a not yet issued entry is selected and a new write transaction is sent to the memory.\footnote{Both \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} also need to ensure that currently no back pressure is applied onto the cache from the memory.}

Incoming transactions from the memory side are accepted with a \texttt{END\_RESP} and, in case of a fetch transaction, used to update the cache contents and possibly preparing a new response transaction for the processor as described before.

This example works analogously with an other cache as the requesting module or an other cache as the target module for a fetch or write back accesses.

It is to note that the current implementation does not utilize a snooping protocol.
Therefore, cache coherency is not guaranteed and memory shared between multiple processor cores will result in incorrect results as the values are not synchronized between the caches.
However, it is to expect that this will not drastically affect the simulation results for applications with few shared resources.
The implementation of a snooping protocol is a candidate for future improvements.
%However, it is to expect that this will not drastically affect the simulation results.

\subsection{A New Trace Player Interface}
\label{sec:traceplayer_interface}