bachelor-thesis/inc/6.implementation.tex

\section{Implementation}
\label{sec:implementation}

In this section, the new components that were developed, which enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced.

At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained.
Furthermore, the new trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system.
% Oder auch nicht: ?
The last part will concentrate on the special architecture of the new trace player interface and challenges the internal interconnection solves.

\subsection{Analysis Tool}
\label{sec:analysis_tool}

As described in section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used.

DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool.
It is a purely observational client and does not modify the behavior of the application.

Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system:
As the virtual address space is unique for every running process and need to be translated to physical addresses by the operating system kernel to access the real memory, these physical addresses should be traced instead of the virtuals.

It is to note that the physical addresses do not directly correspond into the internal addresses of the memory subsystem:
The physical memory is mapped at a specific address in the physical address space, so this address offset has to be considered.
On Linux systems, this mapping can be obtained by reading the contents of the virtual file \texttt{/proc/iomem}.
The trace player then needs to substract this offset as it will be explained in section \ref{sec:dbiplayer_functionality}.

The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) in modern kernel versions.
The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline).
Offline tracing has the additional advantage of being able to disassemble the executed instructions afterwards.
For this, the mapping of the executable binaries and shared libraries is stored alongside with the trace, enabling the decoding of the instructions from the traced program counter values.
As of writing this thesis, the offline tracing mode has recently gained support for the physical address conversation, but the online mode will be used throughout this thesis as its support is still limited.

In case of the online tracing, DrCacheSim consists of two separate processes:
\begin{itemize}
 \item
 A client-side process (the DynamoRIO client) which injects observational instructions into the application's code cache.
 For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated.

 \item
 An analyzer-side process which connects to the client and processes the \texttt{memref\_t} data packets.
 The analyzer-side can contain many analysis tools that operate on those stream of records.
\end{itemize}

The \revabbr{inter-process communication}{IPC} between the two parts is achieved through a \textit{named\ pipe}.
Figure \ref{fig:drcachesim} illustrates the structure of the individual parts.

\input{img/thesis.tikzstyles}
\begin{figure}[!ht]
\begin{center}
\tikzfig{img/drcachesim}
\caption{Structure of the DrCacheSim online tracing.}
\label{fig:drcachesim}
\end{center}
\end{figure}

A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated.
In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a separate file, so that it is possible to decode named instructions even after the application has exited.
In case of online tracing, the analyzer has to inspect the memory of the client-side process for this.

Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.

The newly developed DRAMTracer tool creates for every application thread a separate trace file.
As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
For every data reference, a new entry in the corresponding trace file is made which contains the size and the physical address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last reference.
This instruction count is used to approximate the delay between the memory accesses when the trace is replayed by DRAMSys.

\begin{listing}
\begin{textcode}
# instruction count,read/write,data size,data address
# <timestamp>
<13300116157764414>
3,r,8,1190cf3f0
9,w,16,1190cf270
2,r,8,10200be48
0,w,16,1190cf280
1,w,16,1190cf290
2,w,16,1190cf2a0
1,w,16,1190cf2b0
0,w,16,1190cf2c0
\end{textcode}
\caption[Example of a memory access trace with a timestamp.]{Example of a memory access trace with a timestamp. For each thread, a seperate trace file is generated.}
\label{list:memtrace}
\end{listing}

As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework.
Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool.

Also, to be able to decode the instructions in the online tracing, a set of patches had to be applied to DynamoRIO.

\subsection{Trace Player Architecture}
\label{sec:dbiplayer_architecture}

This section covers the general architecture of the \textit{DbiPlayer}, the new trace player for DRAMSys that replays the captured trace files.

For every recorded thread, a new so-called DbiThreadPlayer is spawned, which is a standalone initiator for transactions.
Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary.
This new interface will be further discussed in section \ref{sec:traceplayer_interface}.

For the \textit{DbiPlayer}, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in figure \ref{fig:dbiplayer_without_caches}.

\begin{figure}
\begin{center}
\tikzfig{img/without_caching}
\caption{Architecture of the \textit{DbiPlayer} without caches.}
\label{fig:dbiplayer_without_caches}
\end{center}
\end{figure}

As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches of today's processors completely.
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache_implementation}.
Many modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
This hierarchy is also reflected in the \textit{DbiPlayer} as shown in Figure \ref{fig:dbiplayer_with_caches}.

\begin{landscape}
\begin{figure}
\begin{center}
\tikzfig{img/with_caching}
\caption{Architecture of the \textit{DbiPlayer} with caches.}
\label{fig:dbiplayer_with_caches}
\end{center}
\end{figure}
\end{landscape}

\subsection{Trace Player Functionality}
\label{sec:dbiplayer_functionality}

With the overall architecture of the initiator introduced, this section explains the internal functionality of the \textit{DbiPlayer} and its threads.

The threads of the \textit{DbiPlayer} are specialized initiator modules that inherit from the more generic \texttt{TrafficInitiatorThread} class.
Each \texttt{TrafficInitiatorThread} consists of an \texttt{sendNextPayloadThread()} \texttt{SC\_THREAD} that inturn calls the virtual method \texttt{sendNextPayload()}, that is implemented in the \texttt{DbiThreadPlayer}, each time the \texttt{sc\_event\_queue} \texttt{sendNextPayloadEvent} is being notified.

Each \texttt{DbiThreadPlayer} iterates through its trace file and stores the entries in an internal buffer.
In \texttt{sendNextPayload()} then, a new generic payload object is created from the next entry of this buffer.
The address of the payload is calculated from the physical address stored in the trace file entry.
As previously discussed, the trace player now needs to account for the offset the RAM was placed at in the physical memory map and substract this offset from the physical address.
The instruction count field of the trace is used to approximate the delay between two consecutive memory accesses:
The count is multiplied with the trace player clock period and a constant to defer the initiation of the next transaction by the resulting value.
While this does not take the type of the executed instructions into account, it is still a simple approximation that can be made.

As mentioned previously, the threads cannot run by themselves, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far, or to advance the global time, when the thread is allowed to run.
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the \textit{DbiPlayer} uses to schedule its threads.

A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
\begin{enumerate}
\item The main thread at the start of the program is always allowed to run.
\item Threads don't go to sleep when they would produce a deadlock. This is the case when they are the only thread currently running.
\item When a previous running thread exits and all other threads are sleeping, then they will be woken up.
\item As a fallback, when currently all threads are waiting, one thread will be woken up.
\end{enumerate}

Those rules reconstruct the thread interleaving of the application as it was running while being traced.
The two latter rules ensure that always at least one thread is running so that the simulation does not come to a premature halt.

\subsection{Non-Blocking Cache}
\label{sec:cache_implementation}

This section gives an overview over the cache model that is used by the new trace player.
It is implemented as a non-blocking cache that, as explained in section \ref{sec:caches_non_blocking_caches}, can accept new requests even when multiple cache misses are being handled.

The cache inherits from the \texttt{sc\_module} base class and consists of a target socket, to accept requests from the processor or a higher level cache, as well as an initiator socket, to send requests to a lower level cache or to the DRAM subsystem.
It has a configurable size, associativity, cache line size, MSHR buffer depth, write buffer depth and target depth for one MSHR entry.

To understand how the cache model works, a hypothetical request from the CPU will be assumed to explain the internal processing of the transaction in detail:

When the transaction arrives, it will be placed in the PEQ of the cache from where, after the specified amount of delay, the handler for the \texttt{BEGIN\_REQ} phase is called.
The handler verifies that the cache buffers are not full\footnote{Otherwise the cache will apply backpressure on the CPU and postpone the handling of the transaction.} and checks if the requested data is stored in the cache.
If it is the case (i.e. a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with a response, accesses the cache and sends the \texttt{BEGIN\_RESP} phase to the processor.
During a cache access, the content of the cache line is copied into the transaction in case of a read request, or the cache line is updated with the new value in case of a write request.
Further, in both cases the timestamp of the last access is updated to the current simulation time.
The processor then finalizes the transaction with the \texttt{END\_RESP} phase, the target backpressure of the cache will be cleared and the postponed request from the CPU (if it exists) is now placed into the PEQ once again.

If, on the other hand, the requested data is not in the cache (i.e. a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
If this is the case\footnote{And if the target list of the MSHR entry is not full. Otherwise the transaction will be postponed.}, the transaction is appended to it as an additional target.
If not, a cache line is evicted to make space for the new cache line that will be fetched from the underlying memory.
The cache model implements the optimal replacement policy LRU, so the cache line with the oldest last access time is chosen to be evicted.
When an eviction is not possible, the transaction will be postponed.
An eviction is not possible when the selected cache line is allocated but not yet filled with requested data from the underlying cache, the cache line is currently present in the MSHR queue, or a hit for this cache line is yet to be handled.
When the \texttt{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
The newly evicted cache line is now \textit{allocated}, but not \textit{valid}.
Then, the transaction is put in an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.

To process the entries in the MSHR and in the write buffer, the \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} methods are called at appropriate times.
In the former, a not yet issued MSHR entry is selected for which a new fetch transaction is generated and sent to the underlying memory.
Note that special care has to be taken when the requested cache line is also present in the write buffer:
To ensure consistency, no new request is sent to the DRAM and instead the value is snooped out of the write buffer.
In the latter, the processing of the write back buffer, a not yet issued entry is selected and a new write transaction is sent to the memory.\footnote{Both \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} also need to ensure that currently no backpressure is applied onto the cache from the memory.}

Incoming transactions from the memory side are accepted with a \texttt{END\_RESP} and, in case of a fetch transaction, used to update the cache contents and possibly preparing a new response transaction for the processor as described before.

This example works analogously with another cache as the requesting module or another cache as the target module for a fetch or write back accesses.

The rough internal structure of the cache model is shown again in figure \ref{fig:cache}.

\begin{figure}
\begin{center}
\tikzfig{img/cache}
\caption[Internal architecture of the cache model.]{Internal architecture of the cache model. \textit{V} stands for \textit{valid}, \textit{D} for \textit{dirty}, \textit{A} for \textit{allocated}, \textit{T} for \textit{tag}, \textit{AT} for \textit{access time}, \textit{I} for \textit{issued} and \textit{Idx} for \textit{index}. In the cache line array, adjacent lines with the same addressing index are colored in the same gray shade. The size of such a group is the \textit{associativity}.}
\label{fig:cache}
\end{center}
\end{figure}

It is to note that the current implementation does not utilize a snooping protocol.
Therefore, cache coherency is not guaranteed and memory shared between multiple processor cores will result in incorrect results as the values are not synchronized between the caches.
However, it is to expect that this will not drastically affect the simulation results for applications with few shared resources.
The implementation of a snooping protocol is a candidate for future improvements.

\subsection{Trace Player Interface}
\label{sec:traceplayer_interface}

Previously, initiators could only represent one thread when they are connected to DRAMSys.
This, however, conflicted with the goal to develop an trace player module that internally composes of multiple threads that communicate with each other and initiate transactions to DRAMSys independently.

To be able to couple such hierarchical initiator modules with DRAMSys, a new trace player interface was developed:
The \texttt{TrafficInitiatorIF} is a generic interface that every module that connectes to DRAMSys needs to implement.
It requires to implement the \texttt{bindTargetSocket()} method so that top-level initiators can be coupled regardless of the used initiator socket type (e.g. \texttt{simple\_initiator\_socket} or \texttt{multi\_passthrough\_initiator\_socket}).

When coupling a \texttt{multi\_passthrough\_initiator\_socket} to a \texttt{multi\_passthrough\_\\target\_socket}, the SystemC \texttt{bind()} method has to be called multiple times - once for each thread.
Because of this, a wrapper module also has to overwrite the \\ \texttt{getNumberOfThreads()} method of the new interface and use this number to bind the target socket in \texttt{bindTargetSocket()} the correct number of times.

This makes it possible to polymorphically treat all initiator modules, whether they are simple threads or more complex wrapper modules, as this interface and connect them to DRAMSys with the provided bind method, abstracting away the concrete type of initiator socket used.

So with the new trace player interface, a top-level initiator can either be a single thread, like in previous versions, or a more complex hierarchical module with many internal threads.

\subsection{Interconnect}
\label{sec:interconnect}

As already seen in figure \ref{fig:dbiplayer_with_caches}, interconnection modules are needed to connect the caches with each other.
While the implementation of the \textit{MultiCoupler} component is trivial as it only passes the transactions from its so-called \texttt{multi\_passthrough\_target\_socket} to its \texttt{multi\_passthrough\_initiator\_socket}, the \textit{MultiSimpleCoupler} is more complex because it has to internally buffer transactions.

In order to understand why this buffering needed, consider scenario where the L3 cache applies backpressure to one L2 cache.
The L2 cache is not allowed to send further requests due to the exclusion rule.
But since the target socket of the L3 cache is occupied, this also applies to all other other L2 caches.
This information, however, is not propagated to the other caches, leading to an incorrect behavior if not addressed, as the other caches will send further requests.

To solve this problem, the MultiSimpleCoupler only forwards requests to the L3 cache when it is able to accept them.
If this is not the case, the request gets internally buffered and forwarded when an earlier request is being completed with the \texttt{END\_REQ} phase.

% Beispiel
For illustrating this further, a simple example can be assumed:
One L2 cache needs to request a cache line from the underlying L3 cache.
The MultiSimpleCoupler receives the \texttt{BEGIN\_REQ} phase and places it into its PEQ.
From there, a hash table used as an internal routing table is updated to be able to send the response back through the correct multi-socket binding afterwards.
As the L3 cache is currently not applying backpressure onto the interconnect, it can forward the transaction with the \texttt{BEGIN\_REQ} phase to the L3 cache.
Until the L3 cache responds with the \texttt{END\_REQ} phase, the interconnect defers any new request from any L2 cache and buffers the payload objects in an internal data structure.
When the \texttt{END\_REQ} phase is received, the next transaction from this request buffer is sent to the L3 cache.
After some time the, L3 cache will respond with the requested cache lines.
During this \texttt{BEGIN\_RESP} phase, the L2 cache that requested this line is looked up using the routing table and the payload is sent back to it.
Until the L2 cache responds with an \texttt{END\_RESP}, the exclusion rule has to be honored also here:
When a new response from the L3 cache is received, it has to be buffered into another internal data structure until the corresponding target socket binding is clear again.
Once the L2 cache sends out the \texttt{END\_RESP} phase, the interconnect will forward the \texttt{END\_RESP} to the L3 cache, and initiate new response transactions in case the response buffer is not empty.

In conclusion, this special interconnect module with an multi-target socket and a simple-initiator socket ensures that the exclusion rule is respected in both directions.