186 lines
14 KiB
TeX
186 lines
14 KiB
TeX
\section{Implementation}
|
|
\label{sec:implementation}
|
|
|
|
In this section, the new components that were developed, which enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced.
|
|
|
|
At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained.
|
|
Furthermore, the new trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system.
|
|
% Oder auch nicht: ?
|
|
The last part will concentrate on the special architecture of the new trace player interface and challenges the internal interconnection solves.
|
|
|
|
\subsection{Analysis Tool}
|
|
\label{sec:analysis_tool}
|
|
|
|
As described in section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
|
|
Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used.
|
|
|
|
DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool.
|
|
It is a purely observational client and does not modify the behavior of the application.
|
|
|
|
Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system.
|
|
The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) in modern kernel versions.
|
|
The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline).
|
|
As of writing this thesis, the offline tracing mode does not yet support the physical address conversation, so the online mode has to be used.
|
|
|
|
In case of the online tracing, DrCacheSim consists of two separate processes:
|
|
\begin{itemize}
|
|
\item
|
|
A client-side process (the DynamoRIO client) which injects observational instructions into the application's code cache.
|
|
For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated.
|
|
|
|
\item
|
|
An analyzer-side process which connects to the client and processes the \texttt{memref\_t} data packets.
|
|
The analyzer-side can contain many analysis tools that operate on those stream of records.
|
|
\end{itemize}
|
|
|
|
The \revabbr{inter-process communication}{IPC} between the two parts is achieved through a \textit{named\ pipe}.
|
|
Figure \ref{fig:drcachesim} illustrates the structure of the individual parts.
|
|
|
|
\input{img/thesis.tikzstyles}
|
|
\begin{figure}[!ht]
|
|
\begin{center}
|
|
\tikzfig{img/drcachesim}
|
|
\caption{Structure of the DrCacheSim online tracing.}
|
|
\label{fig:drcachesim}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
|
|
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
|
|
For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided.
|
|
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated.
|
|
In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a seperate file, so that it is possible to decode named instructions even after the application has exited.
|
|
In case of online tracing, the analyzer has to inspect the memory of the client-side process for this.
|
|
|
|
Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
|
|
In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.
|
|
|
|
The newly developed DRAMTracer tool creates for every thread of the application a seperate trace file.
|
|
As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
|
|
For every data reference, a new entry in the corresponding trace file is made which contains the size and the physical address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last reference.
|
|
This instruction count is used to approximate the delay between the memory accesses when the trace is replayed by DRAMSys.
|
|
|
|
\begin{listing}
|
|
\begin{textcode}
|
|
# instruction count,read/write,data size,data address
|
|
# <timestamp>
|
|
<13295366593324052>
|
|
4,r,8,1774ef30
|
|
0,r,8,1774ef38
|
|
1,w,8,1774ef28
|
|
2,w,8,1774ee88
|
|
0,r,8,17744728
|
|
1,r,8,238c3fb0
|
|
\end{textcode}
|
|
\caption{Example of a memory access trace with a timestamp.}
|
|
\label{list:memtrace}
|
|
\end{listing}
|
|
|
|
As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework.
|
|
Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool.
|
|
|
|
Also, to be able to decode the instructions in the online tracing, a set of patches had to be applied to DynamoRIO.
|
|
|
|
\subsection{Trace Player Architecture}
|
|
\label{sec:dbiplayer_architecture}
|
|
|
|
This section covers the general architecture of the DbiPlayer, the new trace player for DRAMSys that replays the captured trace files.
|
|
|
|
For every recorded thread, a new so-called DbiThreadPlayer is spawned, which is a standalone initiator for transactions.
|
|
Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other.
|
|
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
|
|
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
|
|
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary.
|
|
This new interface will be further discussed in section \ref{sec:traceplayer_interface}.
|
|
|
|
For the DbiPlayer, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in figure \ref{fig:dbiplayer_without_caches}.
|
|
|
|
\begin{figure}
|
|
\begin{center}
|
|
\tikzfig{img/without_caching}
|
|
\caption{Architecture of the DbiPlayer without caches.}
|
|
\label{fig:dbiplayer_without_caches}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches of today's processors completely.
|
|
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache_implementation}.
|
|
Modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
|
|
% (vlt hier Literaturreferenz)
|
|
This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:dbiplayer_with_caches}.
|
|
|
|
\begin{landscape}
|
|
\begin{figure}
|
|
\begin{center}
|
|
\tikzfig{img/with_caching}
|
|
\caption{Architecture of the DbiPlayer with caches.}
|
|
\label{fig:dbiplayer_with_caches}
|
|
\end{center}
|
|
\end{figure}
|
|
\end{landscape}
|
|
|
|
\subsection{Trace Player Functionality}
|
|
\label{sec:dbiplayer_functionality}
|
|
|
|
With the overall architecture of the initiator introduced, this section explains the internal functionality of the DbiPlayer and its threads.
|
|
As mentioned previously, the threads cannot run by themselves, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
|
|
The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far, or to advance the global time, when the thread is allowed to run.
|
|
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the DbiPlayer uses to schedule its threads.
|
|
|
|
A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
|
|
\begin{enumerate}
|
|
\item The main thread at the start of the program is always allowed to run.
|
|
\item Threads don't go to sleep when they would produce a deadlock. This is the case when they are the only thread currently running.
|
|
\item When a previous running thread exits and all other threads are sleeping, then they will be woken up.
|
|
\item As a fallback, when currently all threads are waiting, one thread will be woken up.
|
|
\end{enumerate}
|
|
|
|
Those rules ensure that always at least one thread is running and the simulation does not come to a premature halt.
|
|
|
|
Each running thread iterates through its trace file and initiates the transactions to the specified physical address.
|
|
The instruction count field is used to approximate the delay between the memory accesses:
|
|
The value is multiplied with the trace player clock and delays the next transaction by the result.
|
|
While this does not take the type of the executed instructions into account, it is still a simple approximation that can be made.
|
|
|
|
\subsection{Non-Blocking Cache}
|
|
\label{sec:cache_implementation}
|
|
|
|
This section gives an overview over the cache model that is used by the new trace player.
|
|
It is implemented as a non-blocking cache that, as explained in section \ref{sec:caches_non_blocking_caches}, can accept new requests even when multiple cache misses are being handled.
|
|
|
|
The cache inherits from the \texttt{sc\_module} base class and has a target socket, to accept requests from the processor or higher level cache, as well as an initiator socket, to send requests to a lower level cache or to the DRAM subsystem.
|
|
It has a configurable size, associativity, cache line size, MSHR buffer depth, write buffer depth and target depth for one MSHR entry.
|
|
|
|
To understand how the cache model works, a hypothetical request from the CPU will be assumed to explain the internal processing of the transaction in detail:
|
|
|
|
When the transaction arrives, it will be placed in the PEQ of the cache from where the handler for the \texttt{BEGIN\_REQ} phase is called.
|
|
The handler verifies that the cache buffers are not full\footnote{Otherwise the cache will apply back pressure on the CPU and postpone the handling of the transaction.} and checks if the requested data is stored in the cache.
|
|
If it is the case (i.e. a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with an response, accesses the cache\footnote{In case of a read transaction, the content of the cache line is copied into the transaction; in case of a write transaction, the cache line is updated with the new value.} and sends the \texttt{BEGIN\_RESP} phase to the processor.
|
|
The processor then finalizes the transaction with \texttt{END\_RESP}, the target back pressure of the cache will be cleared, and the postponed request from the CPU (if it exists) is now placed into the PEQ again.
|
|
|
|
On the other hand, when the requested data is not in the cache (i.e. a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
|
|
If it is the case\footnote{And if the target list of the MSHR entry is not full. When this is the case, the transaction is postponed.}, the transaction is appended to it as an additional target.
|
|
If not, a cache line is evicted\footnote{When an eviction is not possible, the transaction is postponed.} to make space for the new cache line that will be fetched from the underlying memory.
|
|
When the \texttt{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
|
|
The newly evicted cache line is now \textit{allocated}, but not \textit{valid}.
|
|
Then, the transaction is put in an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.
|
|
|
|
To process the entries in the MSHR and in the write buffer, the \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} methods are called at appropriate times.
|
|
In the former, a not yet issued MSHR entry is selected for which a new fetch transaction is generated and sent to the underlying memory.
|
|
Note that special care has to be taken when the requested cache line is also present in the write buffer:
|
|
To ensure consistency, no new request is sent to the DRAM and instead the value is snooped out of the write buffer.
|
|
In the latter, the processing of the write back buffer, a not yet issued entry is selected and a new write transaction is sent to the memory.\footnote{Both \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} also need to ensure that currently no back pressure is applied onto the cache from the memory.}
|
|
|
|
Incoming transactions from the memory side are accepted with a \texttt{END\_RESP} and, in case of a fetch transaction, used to update the cache contents and possibly preparing a new response transaction for the processor as described before.
|
|
|
|
This example works analogously with an other cache as the requesting module or an other cache as the target module for a fetch or write back accesses.
|
|
|
|
It is to note that the current implementation does not utilize a snooping protocol.
|
|
Therefore, cache coherency is not guaranteed and memory shared between multiple processor cores will result in incorrect results as the values are not synchronized between the caches.
|
|
However, it is to expect that this will not drastically affect the simulation results for applications with few shared resources.
|
|
The implementation of a snooping protocol is a candidate for future improvements.
|
|
%However, it is to expect that this will not drastically affect the simulation results.
|
|
|
|
\subsection{A New Trace Player Interface}
|
|
\label{sec:traceplayer_interface}
|