Cache coherency

This commit is contained in:
2022-05-27 20:12:28 +02:00
parent dc21e1191b
commit 207e1c8c1c
5 changed files with 85 additions and 22 deletions

View File

@@ -60,18 +60,42 @@ encoding=UTF-8
highlight=LaTeX highlight=LaTeX
mode=LaTeX mode=LaTeX
[item:inc/3.systemc.tex]
archive=true
encoding=UTF-8
highlight=LaTeX
mode=LaTeX
[item:inc/4.caches.tex] [item:inc/4.caches.tex]
archive=true archive=true
encoding=UTF-8 encoding=UTF-8
highlight=LaTeX highlight=LaTeX
mode=LaTeX mode=LaTeX
[item:inc/5.dramsys.tex]
archive=true
encoding=UTF-8
highlight=LaTeX
mode=LaTeX
[item:inc/6.implementation.tex] [item:inc/6.implementation.tex]
archive=true archive=true
encoding=UTF-8 encoding=UTF-8
highlight=LaTeX highlight=LaTeX
mode=LaTeX mode=LaTeX
[item:inc/7.simulation_results.tex]
archive=true
encoding=
highlight=
mode=
[item:inc/8.future_work.tex]
archive=true
encoding=
highlight=
mode=
[item:inc/appendix.tex] [item:inc/appendix.tex]
archive=true archive=true
encoding=UTF-8 encoding=UTF-8

View File

@@ -100,7 +100,7 @@
title = {The Gem5 Simulator}, title = {The Gem5 Simulator},
year = {2011}, year = {2011},
issn = {0163-5964}, issn = {0163-5964},
month = {aug}, month = aug,
number = {2}, number = {2},
volume = {39}, volume = {39},
abstract = {The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.}, abstract = {The gem5 simulation infrastructure is the merger of the best aspects of the M5 [4] and GEMS [9] simulators. M5 provides a highly configurable simulation framework, multiple ISAs, and diverse CPU models. GEMS complements these features with a detailed and exible memory system, including support for multiple cache coherence protocols and interconnect models. Currently, gem5 supports most commercial ISAs (ARM, ALPHA, MIPS, Power, SPARC, and x86), including booting Linux on three of them (ARM, ALPHA, and x86).The project is the result of the combined efforts of many academic and industrial institutions, including AMD, ARM, HP, MIPS, Princeton, MIT, and the Universities of Michigan, Texas, and Wisconsin. Over the past ten years, M5 and GEMS have been used in hundreds of publications and have been downloaded tens of thousands of times. The high level of collaboration on the gem5 project, combined with the previous success of the component parts and a liberal BSD-like license, make gem5 a valuable full-system simulation tool.},

View File

@@ -6,14 +6,15 @@ A special focus will also be placed on non-blocking caches.
The theory will be based on the chapters \textit{An Overview of Cache Principles} and \textit{Logical Organization} of \cite{Jacob2008} and on \cite{Jahre2007}. The theory will be based on the chapters \textit{An Overview of Cache Principles} and \textit{Logical Organization} of \cite{Jacob2008} and on \cite{Jahre2007}.
With the advancement of faster multi-core processors, the performance difference to the main \revabbr{dynamic random-access memory}{DRAM} is increasing, commonly referred to as the \textit{memory wall}. With the advancement of faster multi-core processors, the performance difference to the main \revabbr{dynamic random-access memory}{DRAM} is increasing, commonly referred to as the \textit{memory wall}.
Therefore caches, whose goal is to decrease the latency and increase the bandwidth of an access to the memory subsystem, play an important role when it comes to the overall performance of computing systems. Therefore caches, whose goal is to decrease the latency and increase the bandwidth of an memory access, play an important role when it comes to the overall performance of computing systems.
Caches are faster than DRAM, but only provide a small capacity, as the per-bit cost is larger. Caches are faster than DRAM, but only provide a small capacity, as the per-bit cost is larger.
For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache. For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache to improve performance.
The two most important heuristics that make this possible will be explained in section \ref{sec:caches_locality_principles}. The two most important heuristics that make this possible will be explained in section \ref{sec:caches_locality_principles}.
After that the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}. After that the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}.
Replacement policies will be explained in \ref{sec:replacement_policies} and write policies in \ref{sec:write_policies}, followed by the considerations to make when it comes to virtual addressing in section \ref{sec:caches_virtual_addressing}. Replacement policies will be explained in \ref{sec:replacement_policies} and write policies in \ref{sec:write_policies}, followed by the considerations to make when it comes to virtual addressing in section \ref{sec:caches_virtual_addressing}.
Section \ref{sec:caches_coherency} gives a short introduction on cache coherency and snooping.
Finally, the advantage of non-blocking caches is the topic of section \ref{sec:caches_non_blocking_caches}. Finally, the advantage of non-blocking caches is the topic of section \ref{sec:caches_non_blocking_caches}.
\subsection{Locality Principles} \subsection{Locality Principles}
@@ -157,7 +158,22 @@ Such a cache is called \textit{virtually indexed} and \textit{physically tagged}
The result from the TLB, the physical page number, needs to be compared to tag that is stored in the cache. The result from the TLB, the physical page number, needs to be compared to tag that is stored in the cache.
When the tag and the physical page number match, then the cache entry is valid for this virtual address. When the tag and the physical page number match, then the cache entry is valid for this virtual address.
Note that when the cache index is completely contained in the page offset, another problem called \textit{aliasing} is resolved, which will not further be discussed in this thesis. Note that when the cache index is completely contained in the page offset, another problem called \textit{aliasing} can be resolved, which will not further be discussed in this thesis.
\subsection{Cache Coherency}
\label{sec:caches_coherency}
In multi-core environments, caches become a distributed system.
As every core uses its own set of caches and possibly shares a cache at the last stage with the other processors, a new problem arises.
Should two or more cores operate on the same shared data, multiple copies of the data will be placed in the private caches and it must be guaranteed that all cores agree on the actual value the data has at any point in time.
Divergent perceptions of the same data are to be regarded as errors.
Therefore, it is important to guarantee \textit{cache coherency}.
One of the solutions for cache coherency is the use of a so-called snooping protocol.
A cache will snoop the cache coherence bus to examine if it already has a copy of requested data.
Snooping packets then are used to update or invalidate other copies of the data.
Snooping protocols can be very complex and hard to formally verify that they in fact guarantee cache coherence.
For this reason, they will not further discussed in this thesis.
\subsection{Non-blocking Caches} \subsection{Non-blocking Caches}
\label{sec:caches_non_blocking_caches} \label{sec:caches_non_blocking_caches}
@@ -182,3 +198,5 @@ An architecture of a MSHR file is illustrated in figure \ref{fig:mshr_file}.
\label{fig:mshr_file} \label{fig:mshr_file}
\end{center} \end{center}
\end{figure} \end{figure}
When the data for a cache miss is returned from the underlying memory, the cache will be updated, all targets of the MSHR entry will be served with the value and the MSHR entry will eventually get deallocated.

View File

@@ -2,7 +2,7 @@
\label{sec:dramsys} \label{sec:dramsys}
DRAMSys is an open-source design space exploration framework, capable of simulating the latest \revabbr{Joint Electron Device Engineering Council}{JEDEC} DRAM standards. DRAMSys is an open-source design space exploration framework, capable of simulating the latest \revabbr{Joint Electron Device Engineering Council}{JEDEC} DRAM standards.
It is optimized to achieve high simulation speeds and utilizes the TLM-AT coding style while still achieving cycle accurate results\cite{Steiner2020}. It is optimized to achieve high simulation speeds and utilizes the TLM-AT coding style while still achieving cycle accurate results \cite{Steiner2020}.
DRAMSys is composed of an arbitration \& mapping unit (also called arbiter) and independent channel controllers with a DRAM device each. DRAMSys is composed of an arbitration \& mapping unit (also called arbiter) and independent channel controllers with a DRAM device each.
The general architecture of DRAMSys is illustrated in figure \ref{fig:dramsys}. The general architecture of DRAMSys is illustrated in figure \ref{fig:dramsys}.
@@ -16,7 +16,7 @@ The general architecture of DRAMSys is illustrated in figure \ref{fig:dramsys}.
\end{figure} \end{figure}
Several initiators can be connected to the arbiter, sending requests to the DRAM subsystem. Several initiators can be connected to the arbiter, sending requests to the DRAM subsystem.
An initiator can either be a sophisticated processor model like the gem5 out of order processor model\cite{Binkert2011} or a trace player that simply replays a trace file containing a sequence of memory requests and timestamps. An initiator can either be a sophisticated processor model like the gem5 out of order processor model \cite{Binkert2011} or a trace player that simply replays a trace file containing a sequence of memory requests and timestamps.
To support a large variety of DRAM standards robustly and error-free, DRAMSys uses a formal domain specific language based on petri nets called DRAMml. To support a large variety of DRAM standards robustly and error-free, DRAMSys uses a formal domain specific language based on petri nets called DRAMml.
This language includes a standards timing dependencies between all DRAM commands and compiles to source code of the internal timing checkers that ensure compliance to the specific standard \cite{Jung2017a}. This language includes a standards timing dependencies between all DRAM commands and compiles to source code of the internal timing checkers that ensure compliance to the specific standard \cite{Jung2017a}.

View File

@@ -1,11 +1,12 @@
\section{Implementation} \section{Implementation}
\label{sec:implementation} \label{sec:implementation}
In this section, the new components that were developed that enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced. In this section, the new components that were developed, which enable the tracing of an arbitrary application in real-time, as well as the replay of those traces in DRAMSys, will be introduced.
At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained. At first, the DynamoRIO analyzer tool that produces the memory access traces and its place in the DrCacheSim-Framework will be explained.
Furthermore, the trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system. Furthermore, the trace player for DRAMSys will acquire special focus as well as the mandatory cache model that is used to model the cache-filtering in a real system.
The last part will concentrate on the special architecture of new trace player and challenges the internal interconnection solves. % Oder auch nicht: ?
The last part will concentrate on the special architecture of the new trace player interface and challenges the internal interconnection solves.
\subsection{Analysis Tool} \subsection{Analysis Tool}
\label{sec:analysis_tool} \label{sec:analysis_tool}
@@ -14,21 +15,21 @@ As described in section \ref{sec:dynamorio} the dynamic binary instrumentation t
Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used. Instead of writing a DynamoRIO client from the ground up, the DrCacheSim framework is used.
DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool. DrCacheSim is a DynamoRIO client that gathers memory and instruction access traces and forwards them to an analyzer tool.
It is a purely observational client, as it does not modify the behavior of the application. It is a purely observational client and does not modify the behavior of the application.
Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system. Optionally, DrCacheSim converts the addresses of the memory accesses from virtual addresses into physical addresses, which is an important step for simulating a real memory system.
The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) for modern kernel versions. The physical address conversion only works on Linux and requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability) in modern kernel versions.
The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline). The analyzer tool can either be running alongside with DrCacheSim (online) or operate on an internal trace format (offline).
As of writing this thesis, the offline tracing mode does not yet support the physical address conversation, so the online mode has to be used. As of writing this thesis, the offline tracing mode does not yet support the physical address conversation, so the online mode has to be used.
In case of the online tracing, DrCacheSim consists of two seperate processes: In case of the online tracing, DrCacheSim consists of two seperate processes:
\begin{itemize} \begin{itemize}
\item \item
A client-side which injects observational instructions into the application's code cache. A client-side process (the DynamoRIO client) which injects observational instructions into the application's code cache.
For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated. For every instruction or memory access, a data packet of the type \texttt{memref\_t} is generated.
\item \item
An analyzer-side which connects to the client and processes the \texttt{memref\_t} data packets. An analyzer-side process which connects to the client and processes the \texttt{memref\_t} data packets.
The analyzer-side can contain many analysis tools that operate on those stream of records. The analyzer-side can contain many analysis tools that operate on those stream of records.
\end{itemize} \end{itemize}
@@ -45,13 +46,14 @@ Figure \ref{fig:drcachesim} illustrates the structure of the individual parts.
\end{figure} \end{figure}
A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier. A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} is included in every record to be able to associate them. Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided. For an instruction marker, the size of the instruction as well as the virtual address of the instruction in the memory map is provided.
DrCacheSim stores the current mapping of all binary executables and shared libraries in a seperate file, so that it is possible to decode named instructions even after the application has exited. For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from which it was initiated. In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a seperate file, so that it is possible to decode named instructions even after the application has exited.
In case of online tracing, the analyzer has to inspect the memory of the client-side process for this.
Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner. Analysis tools implement the \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of a tool is called for incoming every record. In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.
The newly developed DRAMTracer tool creates for every thread of the application a seperate trace file. The newly developed DRAMTracer tool creates for every thread of the application a seperate trace file.
As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet. As it is not known how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
@@ -77,6 +79,8 @@ This instruction count is used to approximate the delay between the memory acces
As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework. As of writing this thesis, there is no application binary interface for analysis tools defined in the DrCacheSim-Framework.
Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool. Therefore it is not possible to load the DRAMTracer tool as a shared library but rather it is required to modify the DynamoRIO source code to integrate the tool.
Also, to be able to decode the instructions in the online tracing, a set of patches had to be applied to DynamoRIO.
\subsection{Trace Player Architecture} \subsection{Trace Player Architecture}
\label{sec:dbiplayer_architecture} \label{sec:dbiplayer_architecture}
@@ -86,8 +90,10 @@ For every recorded thread, a new so-called DbiThreadPlayer is spawned, which is
Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other. Because those threads need to be synchronized to approximate the real behavior, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}. The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys. This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
To achieve this, a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary. With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that makes it possible to connect components to DRAMSys whose internal architecture can be arbitrary.
In the case of the DbiPlayer, an additional interconnect module will bundle up all \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in Figure \ref{fig:dbiplayer_without_caches}. This new interface will be further discussed in section \ref{sec:traceplayer_interface}.
For the DbiPlayer, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} to a single \texttt{multi\_passthrough\_initiator\_socket} as presented in figure \ref{fig:dbiplayer_without_caches}.
\begin{figure} \begin{figure}
\begin{center} \begin{center}
@@ -97,8 +103,8 @@ In the case of the DbiPlayer, an additional interconnect module will bundle up a
\end{center} \end{center}
\end{figure} \end{figure}
As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches today's processors completely. As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would neglect the caches of today's processors completely.
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:caches}. Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache_implementation}.
Modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache. Modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
% (vlt hier Literaturreferenz) % (vlt hier Literaturreferenz)
This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:dbiplayer_with_caches}. This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:dbiplayer_with_caches}.
@@ -118,7 +124,7 @@ This hierarchy is also reflected in the DbiPlayer as shown in Figure \ref{fig:db
With the overall architecture of the initiator introduced, this section explains the internal functionality of the DbiPlayer and its threads. With the overall architecture of the initiator introduced, this section explains the internal functionality of the DbiPlayer and its threads.
As mentioned previously, the threads cannot run by themselves, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible. As mentioned previously, the threads cannot run by themselves, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far yet, or to advance the global time, when the thread is allowed to run. The analysis tool appends timestamps into the memory access traces that will be used to pause the execution of a thread, when the global time has not yet reached this far, or to advance the global time, when the thread is allowed to run.
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the DbiPlayer uses to schedule its threads. It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that the DbiPlayer uses to schedule its threads.
A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time: A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
@@ -131,4 +137,19 @@ A set of rules determine if a thread is allowed to make progress beyond a timest
Those rules ensure that always at least one thread is running and the simulation does not come to a premature halt. Those rules ensure that always at least one thread is running and the simulation does not come to a premature halt.
bla bla zu instruction count und clk Each running thread iterates through its trace file and initiates the transactions to the specified physical address.
The instruction count field is used to approximate the delay between the memory accesses:
The value is multiplied with the trace player clock and delays the next transaction by the result.
While this does not take the type of the executed instructions into account, it is still a simple approximation that can be made.
\subsection{Non-Blocking Cache}
\label{sec:cache_implementation}
This section gives an overview over the cache model that is
It is to note that the current implementation does not use a snooping protocol.
Therefore, no cache coherency is guaranteed and memory shared between multiple processor cores will result in incorrect results as the values are not synchronized between the caches.
However, it is to expect that this will not drastically affect the simulation results.
\subsection{A New Trace Player Interface}
\label{sec:traceplayer_interface}