bachelor-thesis/inc/6.implementation.tex

\section{Implementation}

In this section, the new components that were developed that enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced.

At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained.
Furthermore, the trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system.
The last part will concentrate on the special architecture of new trace player and challenges the internal interconnection solves.

\subsection{Analysis Tool}
\label{sec:analysis_tool}

As described in section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used.

DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool.
It is a purely observational client, as it does not modify the behavior of the application.

Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system.
The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) for modern kernel versions.
The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline).
As of writing this thesis, the offline tracing mode does not yet support the physical address conversation, so the online mode has to be used.

In case of the online tracing, DrCacheSim consists of two seperate processes:
\begin{itemize}
 \item
 A client-side which injects observational instructions into the application's code cache.
 For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated.

 \item
 An analyzer-side which connects to the client and processes the \texttt{memref\_t} data packets.
 The analyzer-side can contain many analysis tools that operate on those stream of records.
\end{itemize}

The \revabbr{inter-process communication}{IPC} between the two parts is achieved through a \textit{named\ pipe}.
Figure \ref{fig:drcachesim} illustrates the structure of the individual parts.

\input{img/thesis.tikzstyles}
\begin{figure}
\begin{center}
\tikzfig{img/drcachesim}
\caption{Structure of the DrCacheSim online tracing.}
\label{fig:drcachesim}
\end{center}
\end{figure}

A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} is included in every record to be able to associate them.
For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided.
DrCacheSim stores the current mapping of all binary executables and shared libraries in a seperate file, so that it is possible to decode named instructions even after the application has exited.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from which it was initiated.

Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of a tool is called for incoming every record.

The newly developed DRAMTracer tool creates for every thread of the application a seperate trace file.
As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
For every data reference, a new entry in the corresponding trace file is made which contains the size and the address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last reference.
This instruction count is used to approximate the delay between the memory accesses when the trace is replayed by DRAMSys as described in section TODO.

\begin{listing}
\begin{textcode}
# instruction count,read/write,data size,data address
# <timestamp>
<13295366593324052>
4,r,8,1774ef30
0,r,8,1774ef38
1,w,8,1774ef28
2,w,8,1774ee88
0,r,8,17744728
1,r,8,238c3fb0
\end{textcode}
\caption{Example of a memory access trace with a timestamp.}
\label{list:memtrace}
\end{listing}

As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework.
Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool.

\subsection{DbiPlayer Architecture}
\label{sec:dbiplayer_architecture}

This section covers the general architecture of the DbiPlayer, the new trace player for DRAMSys that replays the captured trace files.

For every recorded thread, a new so-called DbiThreadPlayer is spawned, which is a standalone initiator for transactions.
Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
To achieve this, a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary.
In the case of the DbiPlayer, an additional interconnect module will bundle up all \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in Figure \ref{fig:dbiplayer_without_caches}.

\begin{figure}
\begin{center}
\tikzfig{img/without_caching}
\caption{Architecture of the DbiPlayer without caches.}
\label{fig:dbiplayer_without_caches}
\end{center}
\end{figure}

As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches todays processors completely.
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache}.
Modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
(vlt hier Literaturreferenz)
This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:dbiplayer_with_caches}.

\begin{landscape}
\begin{figure}
\begin{center}
\tikzfig{img/with_caching}
\caption{Architecture of the DbiPlayer with caches.}
\label{fig:dbiplayer_with_caches}
\end{center}
\end{figure}
\end{landscape}

\subsection{DbiPlayer Functionality}
\label{sec:dbiplayer_functionality}

With the overall architecture of the initiator introduced, this section explains the internal functionality of the DbiPlayer and its threads.
As mentioned previously, the threads cannot run by themself, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far yet, or to advance the global time, when the thread is allowed to run.
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the DbiPlayer uses to schedule its threads.

A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
\begin{enumerate}
\item The main thread at the start of the program is always allowed to run.
\item Threads don't go to sleep when they would produce a deadlock. This is the case when they are the only thread currently running.
\item When a previous running thread exits and all other threads are sleeping, then they will be woken up.
\item As a fallback, when currently all threads are waiting, one thread will be woken up.
\end{enumerate}

Those rules ensure that always at least one thread is running and the simulation does not come to a premature halt.

bla bla zu instruction count und clk