Apply Lukas' corrections

This commit is contained in:
2022-07-13 11:27:04 +02:00
parent d890c4cc79
commit a9e7132ed7
12 changed files with 280 additions and 224 deletions

View File

@@ -1,28 +1,28 @@
\section{Implementation}
\label{sec:implementation}
In this section, the the components developed in this thesis for the new simulator frontend, that enable the tracing of an arbitrary application in real-time, as well as the replay of the recorded traces in DRAMSys, will be introduced.
In this section, the developed components for the new simulator frontend, which enable the tracing of an arbitrary application in real-time, as well as the replay of the recorded traces in DRAMSys, will be introduced.
To briefly summarize which components are necessary to implement the new simulation frontend, they are briefly listed below:
\begin{itemize}
\item A DynamoRIO client that traces memory accesses from an running application.
\item A DynamoRIO client that traces memory accesses from a running application.
\item A simplified core model that replays those traces by sending transactions to DRAMSys.
\item A cache model that simulates the cache-filtering of memory requests of the processor.
\item A cache model that simulates the cache filtering of memory requests of the processor.
\end{itemize}
The following sections will first explain the DynamoRIO analysis tool that generates the memory access traces and its place in the DrMemtrace framework.
Furthermore, the new trace player for DRAMSys will acquire special attention as well as the mandatory cache model that is used to model the cache-filtering in a real system.
The last part will concentrate on the special architecture of the new trace player interface and challenges the internal interconnection solves.
The last part will concentrate on the special architecture of the new trace player interface and challenges, that the internal interconnection solves.
\subsection{Analysis Tool}
\label{sec:analysis_tool}
As described in section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrMemtrace framework, that comes bundled with DynamoRIO, is used.
As described in Section \ref{sec:dynamorio} the dynamic binary instrumentation tool DynamoRIO will be used to trace the memory accesses while the target application is running.
Instead of writing a DynamoRIO client from the ground up, the DrMemtrace framework, which comes bundled with DynamoRIO, is used.
DrCacheSim is a DynamoRIO client that build on the DrMemtrace framework and gathers memory and instruction access traces from the target application and forwards them to one or multiple analyzer tools.
In addition, so-called marker records are sent to an analyzer on certain events, with which meta information such as the cpu core used, kernel events or a timestamp are transmitted.
DrCacheSim is a DynamoRIO client that builds on top of the DrMemtrace framework, which gathers memory and instruction access traces from the target application and forwards them to one or multiple analysis tools.
In addition, so-called marker records are sent to the analysis tools when certain events occur, which are used to transmit meta information such as the CPU core used, kernel events or timestamps.
These markers are also essential for a processor simulation, for example to reconstruct the thread interleaving, as it is intended for the new simulator frontend.
DrCacheSim is a purely observational client and does not alter the behavior of the application.
@@ -33,16 +33,16 @@ These physical addresses should be traced instead of the virtual addresses to a
It should be noted that in most systems the physical addresses do not directly represent the addresses that the memory subsystem perceives.
The physical memory is mapped at a specific address region in the physical address space, so an address offset also has to be considered.
On Linux systems, this mapping can be obtained by investigating the contents of the virtual file \texttt{/proc/iomem}, which is provided by the kernel.
The trace player then substracts this offset as it will be explained in more detail in section \ref{sec:dbiplayer_functionality}.
The physical address conversion only works on Linux and requires in modern kernel versions root privileges (or alternatively the CAP\_SYS\_ADMIN capability).
The trace player then substracts this offset as it will be explained in more detail in Section \ref{sec:dbiplayer_functionality}.
The physical address conversion only works on Linux and, in modern kernel versions, requires root privileges (or alternatively the CAP\_SYS\_ADMIN capability).
There are two different operation modes for an analyzer tool that DrCacheSim provides:
The analyzer tool can either be running alongside with DrCacheSim (online) or run after the target application has exited and operate on an internal trace format (offline).
The analyzer tool can either run alongside with DrCacheSim (online) or run after the target application has exited and operate on an internal trace format (offline).
Offline tracing has the additional advantage of being able to disassemble the executed instructions afterwards.
For this, the mapping of the executable binaries and shared libraries is stored alongside with the trace, enabling the decoding of the instructions from the traced program counter values.
The instruction decoding is currently not natively supported for the online execution model, but this feature received limited attention in the development of the new frontend.
The instruction decoding is currently not natively supported by the online execution model, but this feature received limited attention in the development of the new frontend.
As of writing this thesis, the offline tracing mode has only recently gained support for the physical address conversation.
Nnevertheless, the online execution model will be used throughout this thesis as the physical address support is still limited for offline tracing.
Nevertheless, the online execution model will be used throughout this thesis as the physical address support is still limited for offline tracing.
\input{img/thesis.tikzstyles}
\begin{figure}
@@ -53,7 +53,7 @@ Nnevertheless, the online execution model will be used throughout this thesis as
\end{center}
\end{figure}
In case of the online tracing, DrCacheSim consists of two separate processes:
In the case of online tracing, DrCacheSim consists of two separate processes:
\begin{itemize}
\item
A client-side process (the DynamoRIO client) which injects observational instructions into the application's code cache.
@@ -65,12 +65,12 @@ In case of the online tracing, DrCacheSim consists of two separate processes:
\end{itemize}
The \revabbr{inter-process communication}{IPC} between the two processes is achieved through a \textit{named\ pipe}.
Figure \ref{fig:drcachesim} illustrates the structure of online tracing mechanism.
Figure \ref{fig:drcachesim} illustrates the structure the of online tracing mechanism.
A \texttt{memref\_t} can either represent an instruction, a data reference or a metadata event such as a timestamp or a CPU identifier.
Besides of the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
Besides the type, the \revabbr{process identifier}{PID} and \revabbr{thread identifier}{TID} of the initiating process and thread is included in every record.
For an instruction marker, the size of the instruction as well as the address of the instruction in the virtual address space of the application is provided.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated from.
For data references, the address and size of the desired access is provided as well the \revabbr{program counter}{PC} from where it was initiated.
In offline mode, DrCacheSim stores the current mapping of all binary executables and shared libraries in a separate file, so that it is possible to decode and disassemble the traced instructions even after the application has exited.
As mentioned earlier, instruction decoding is not natively supported for online tracing, but to work around the problem, the analyzer can examine the memory map of the client-side process and read the encoded instructions from there.
@@ -79,20 +79,21 @@ This region of interest can be specified by the number of instructions after whi
All analysis tools implement the common \texttt{analysis\_tool\_t} interface as this enables the analyzer to forward a received record to multiple tools in a polymorphic manner.
In particular, the \texttt{process\_memref\_t()} method of any tool is called for every incoming record.
Virtual functions, such as \texttt{initialize()} and \texttt{print\_results()}, which are called by the analyzer in appropriate places, should also be implemented.
It is possible for a analysis tool to implement parallel processing of the received \texttt{memref\_t} types by splitting up the trace into \textit{shards}.
However, in this thesis the sequential processing of a single sorted and interleaved trace was used because of missing support for parallel processing for the online execution model.
The newly developed DRAMTracer tool creates a separate trace file for every application thread.
Since it is not known a priori how many threads an application will spawn, the tool will listen for records with new TIDs that it did not register yet.
For every data reference, a new entry in the corresponding trace file is made which contains the size and the physical address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last data reference.
For every data reference, a new entry in the corresponding trace file is created which contains the size and the physical address of the access, whether it was a read or write, and also a count of (computational) instructions that have been executed since the last data reference.
To compute the instruction count, a counter is incremented for every registered instruction record and reset again for any data reference.
This instruction count is used, together with the clock period, to approximate the delay between two memory accesses when the trace is replayed by DRAMSys.
This instruction count is used together with the clock period to approximate the delay between two memory accesses when the trace is replayed by DRAMSys.
Lastly, the analysis tool inserts a timestamp into the trace for every received timestamp marker.
The use of this timestamp will be further explained in section \ref{sec:dbiplayer_functionality}.
The use of this timestamp will be further explained in Section \ref{sec:dbiplayer_functionality}.
Listing \ref{list:memtrace} presents an exemplary memory trace.
Lines consisting of a number between two angle brackets represent a timestamp whereas lines for memory references consist of the instruction count, a character denoting a read or write, the size and the address of the access.
Also, comments which are ignored by the trace player are possible by starting the line with a number sign.
Lines consisting of a number between two angle brackets represent a timestamp whereas lines for memory references consist of the instruction count, a character denoting a read or write, the size and the physical address of the access.
Also, comments which, are ignored by the trace player, can be added by starting the line with a number sign.
\begin{listing}
\begin{textcode}
@@ -122,13 +123,13 @@ This section covers the general architecture of the \textit{DbiPlayer}, the new
For every recorded thread, a traffic initiator thread, a so-called \textit{DbiThreadPlayer}, is spawned, which is a standalone initiator for memory transactions.
Because those threads need to be synchronized to approximate real thread interleaving, they need to communicate among each other.
The detailed mechanism behind this synchronization will be further explained in section \ref{sec:dbiplayer_functionality}.
The detailed mechanism behind this synchronization will be further explained in Section \ref{sec:dbiplayer_functionality}.
This communication, however, brings up the necessity to containerize the thread players into a single module that can directly be connected to DRAMSys.
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed which allows components to be connected to DRAMSys whose internal architecture can be arbitrary.
This new interface will be further discussed in section \ref{sec:traceplayer_interface}.
With the old DRAMSys interface for trace players this was not easily realizable, so a new generic initiator interface was developed that allows components to be connected to DRAMSys whose internal architecture can be arbitrary.
This new interface will be further discussed in Section \ref{sec:traceplayer_interface}.
For the \textit{DbiPlayer}, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} in a single \texttt{multi\_passthrough\_initiator\_socket}.
So the \textit{DbiPlayer} is a hierarchical module that consists of a more complex architecture with multiple traffic initiators, illustrated in figure \ref{fig:dbiplayer_without_caches}.
For the \textit{DbiPlayer}, an additional interconnect module will bundle up all \\ \texttt{simple\_initiator\_sockets} into a single \texttt{multi\_passthrough\_initiator\_socket}.
So the \textit{DbiPlayer} is a hierarchical module that consists of a more complex architecture with multiple traffic initiators, illustrated in Figure \ref{fig:dbiplayer_without_caches}.
\begin{figure}
\begin{center}
@@ -139,10 +140,10 @@ So the \textit{DbiPlayer} is a hierarchical module that consists of a more compl
\end{figure}
As the memory accesses are directly extracted from the executed instructions, simply sending a transaction to the DRAM subsystem for every data reference would completely neglect the caches of today's processors.
Therefore, also a cache model is required whose implementation will be explained in more detail in section \ref{sec:cache_implementation}.
Many modern cache hierarchies compose of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
This cache hierarchy is also reflected in the \textit{DbiPlayer} as shown in Figure \ref{fig:dbiplayer_with_caches}, but also more simple hierarchies such as a L1 cache for every processor core and one shared L2 cache are configurable.
In order to connect the different SystemC socket types, one additional interconnect is required which is explained in more detail in section \ref{sec:interconnect}.
Therefore, also a cache model is required whose implementation will be explained in more detail in Section \ref{sec:cache_implementation}.
Many modern cache hierarchies are composed of 3 cache levels: 2 caches for every processor core, the L1 and L2 cache, and one cache that is shared across all cores, the L3 cache.
This cache hierarchy is also reflected in the \textit{DbiPlayer} shown in Figure \ref{fig:dbiplayer_with_caches}, but also more simplistic hierarchies such as an L1 cache for every processor core and one shared L2 cache are configurable.
In order to connect the different SystemC socket types, one additional interconnect is required which is explained in more detail in Section \ref{sec:interconnect}.
\begin{landscape}
\begin{figure}
@@ -160,26 +161,27 @@ In order to connect the different SystemC socket types, one additional interconn
With the overall architecture of the main initiator module introduced, this section explains the internal functionality of the \textit{DbiPlayer} and its threads.
The threads of the \textit{DbiPlayer} are specialized initiator modules that inherit from the more generic \texttt{TrafficInitiatorThread} class.
Each \texttt{TrafficInitiatorThread} consists of an \texttt{sendNextPayloadThread()} \texttt{SC\_THREAD} that inturn calls the virtual method \texttt{sendNextPayload()}, that is implemented in the \texttt{DbiThreadPlayer}, each time the \texttt{sc\_event\_queue} \texttt{sendNextPayloadEvent} is being notified.
Each \texttt{TrafficInitiatorThread} consists of a \texttt{sendNextPayloadThread()} \texttt{SC\_THREAD}, which in turn calls the virtual method \texttt{sendNextPayload()} each time the \texttt{sc\_event\_queue} \texttt{sendNextPayloadEvent} is being notified.
\texttt{sendNextPayload()} is implemented in the \texttt{DbiThreadPlayer}.
Each \texttt{DbiThreadPlayer} iterates through the lines of its trace file and stores the entries in an internal buffer.
In \texttt{sendNextPayload()} then, a new generic payload object is created from the following entry of this buffer.
In \texttt{sendNextPayload()}, a new generic payload object is created from the following entry of this buffer.
The address of the payload is calculated from the physical address stored in the trace file entry.
As previously discussed, the trace player now needs to account for the offset the RAM was placed at in the physical memory map and substract this offset from the physical address.
The instruction count field of the trace is used to approximate the delay between two consecutive memory accesses:
The count is multiplied with the trace player clock period and a constant to defer the initiation of the next transaction by the resulting value.
While this does not take the type of the executed instructions into account, it is still a simple approximation that can be made.
The instruction count field of the trace is used to approximate the delay between two consecutive memory accesses: the count is multiplied with the trace player clock period to defer the initiation of the next transaction by the resulting value.
Additionally, this count can be multiplied by an approximation of the \revabbr{clocks per instruction}{CPI} value.
While this does not take into account the type of the instructions executed, it is still a simple approximation that can be used to model the system more accuratly.
As mentioned previously, the threads should run by themselves without paying attention to the others, rather they require synchronization to ensure the simulated system replicates the real running application as good as possible.
The individual initator threads should run by themselves without paying attention to the others; rather, they require synchronization to ensure the simulated system replicates the real running application as closely as possible.
The analysis tool appends timestamps into the memory access traces.
When such a timestamp is reached, it will be used to pause the execution of a thread, if the global time has not yet reached this far, or to advance the global time, when the thread is allowed to continue.
It is to note that the term global time in this context does not correspond to the SystemC simulation time but denotes a loose time variable that only the \textit{DbiPlayer} uses to schedule its threads.
When such a timestamp is reached, it will be used to pause the execution of a thread if the global time has not yet reached this far, or to advance the global time when the thread is allowed to continue.
Note that the term global time in this context does not correspond to the SystemC simulation time, but denotes a loose time variable that only the \textit{DbiPlayer} uses to schedule its threads.
A set of rules determine if a thread is allowed to make progress beyond a timestamp that is further than the current global time:
A set of rules determine if a thread is allowed to make progress beyond a timestamp that is greater than current global time:
\begin{enumerate}
\item The main thread at the start of the program is always allowed to run.
\item Threads do not suspend themselves when they would produce a deadlock. This is the case when they are the only thread currently running.
\item When a previous running thread exits and all other threads are suspended, then they will be resumed.
\item When a previously running thread exits and all other threads are suspended, then they will be resumed.
\item As a fallback, when currently all threads are suspended, one thread will be resumed.
\end{enumerate}
@@ -189,42 +191,43 @@ The two latter rules ensure that always at least one thread is running so that t
\subsection{Non-Blocking Cache}
\label{sec:cache_implementation}
This section gives an overview over the cache model that is used by the new trace player.
It is implemented as a non-blocking cache that, as explained in section \ref{sec:caches_non_blocking_caches}, can accept new requests even when multiple cache misses are being handled.
This section gives an overview of the cache model that is used by the new trace player.
It is implemented as a non-blocking cache that, as explained in Section \ref{sec:caches_non_blocking_caches}, can accept new requests even when multiple cache misses are being handled.
The cache inherits from the \texttt{sc\_module} base class and consists of a target socket, to accept requests from the processor or a higher level cache, as well as an initiator socket, to send requests to a lower level cache or to the DRAM subsystem.
The cache inherits from the \texttt{sc\_module} base class and consists of a target socket to accept requests from the processor or a higher level cache as well as an initiator socket to send requests to a lower level cache or to the DRAM subsystem.
It has a configurable size, associativity, cache line size, MSHR buffer depth, write buffer depth and target depth for one MSHR entry.
To understand how the cache model works, a hypothetical request from the CPU will be assumed to explain the internal processing of the transaction in detail:
When the transaction arrives, it will be placed in the PEQ of the cache from where, after the specified amount of delay, the handler for the \texttt{BEGIN\_REQ} phase is called.
When the transaction arrives, it will be placed in the PEQ of the cache from where, after the specified amount of delay has elapsed, the handler for the \texttt{BEGIN\_REQ} phase is called.
The handler verifies that the cache buffers are not full\footnote{Otherwise the cache will apply backpressure on the CPU and postpone the handling of the transaction.} and checks if the requested data is stored in the cache.
If it is the case (i.e. a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with a response, accesses the cache and sends the \texttt{BEGIN\_RESP} phase to the processor.
If it is the case (i.e., a cache hit), the cache model sends immediately an \texttt{END\_REQ} and, when the target socket is not currently occupied with a response, accesses the cache and sends the \texttt{BEGIN\_RESP} phase to the processor.
During a cache access, the content of the cache line is copied into the transaction in case of a read request, or the cache line is updated with the new value in case of a write request.
Further, in both cases the timestamp of the last access is updated to the current simulation time.
The processor then finalizes the transaction with the \texttt{END\_RESP} phase, the target backpressure of the cache will be cleared and the postponed request from the CPU (if it exists) is now placed into the PEQ once again.
The processor then finalizes the transaction with the \texttt{END\_RESP} phase, the target backpressure of the cache will be cleared a the postponed request from the CPU (if it exists) is placed into the PEQ once again.
If, on the other hand, the requested data is not in the cache (i.e. a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
If, on the other hand, the requested data is not in the cache (i.e., a cache miss), first it will be checked if there is already an existing MSHR entry for the corresponding cache line.
If this is the case\footnote{And if the target list of the MSHR entry is not full. Otherwise the transaction will be postponed.}, the transaction is appended to it as an additional target.
If not, a cache line is evicted to make space for the new cache line that will be fetched from the underlying memory.
The cache model implements the optimal replacement policy LRU, so the cache line with the oldest last access time is chosen to be evicted.
The cache model implements the optimal replacement policy LRU, so the cache line with the last access time, which lies furthest back in the past, is chosen to be evicted.
When an eviction is not possible, the transaction will be postponed.
An eviction is not possible when the selected cache line is allocated but not yet filled with requested data from the underlying cache, the cache line is currently present in the MSHR queue, or a hit for this cache line is yet to be handled.
When the \texttt{dirty} flag of the old cache line is set, it has to be placed into the write buffer and written back to the memory.
The newly evicted cache line is now \textit{allocated}, but not \textit{valid}.
Then, the transaction is put in an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.
Then, the transaction is put into an MSHR entry and the \texttt{END\_REQ} phase is sent back to the processor.
To process the entries in the MSHR and in the write buffer, the \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} methods are called at appropriate times.
In the former, a not yet issued MSHR entry is selected for which a new fetch transaction is generated and sent to the underlying memory.
Note that special care has to be taken when the requested cache line is also present in the write buffer:
To ensure consistency, no new request is sent to the DRAM and instead the value is snooped out of the write buffer.
Since the cache line in the write buffer is now allocated again in the cache, the entry in the write buffer can be removed to prevent an unnecessary write-back.
In the latter, the processing of the write back buffer, a not yet issued entry is selected and a new write transaction is sent to the memory.\footnote{Both \texttt{processMshrQueue()} and \texttt{processWriteBuffer()} also need to ensure that currently no backpressure is applied onto the cache from the memory.}
Incoming transactions from the memory side are accepted with a \texttt{END\_RESP} and, in case of a fetch transaction, used to update the cache contents and possibly preparing a new response transaction for the processor as described before.
Incoming transactions from the memory side are accepted with an \texttt{END\_RESP} and, in case of a fetch transaction, used to update the cache contents and possibly preparing a new response transaction for the processor as described before.
This example works analogously with another cache as the requesting module or another cache as the target module for a fetch or write back accesses.
The rough internal structure of the cache model is shown again in figure \ref{fig:cache}.
The rough internal structure of the cache model is shown again in Figure \ref{fig:cache}.
\begin{figure}
\begin{center}
@@ -242,46 +245,44 @@ The implementation of a snooping protocol is a candidate for future improvements
\subsection{Trace Player Interface}
\label{sec:traceplayer_interface}
Previously, initiators could only represent one thread when they are connected to DRAMSys.
This, however, conflicted with the goal to develop an trace player module that internally composes of multiple threads that communicate with each other and initiate transactions to DRAMSys independently.
Previously, initiators could only represent one thread when they were connected to \pbox{3cm}{DRAMSys}.
This, however, conflicted with the goal to develop a trace player module that is internally composed of multiple threads, which communicate with each other and initiate transactions to DRAMSys independently.
To be able to couple such hierarchical initiator modules with DRAMSys, a new trace player interface was developed:
To be able to couple such hierarchical initiator modules with DRAMSys, a new trace player interface was developed.
The \texttt{TrafficInitiatorIF} is a generic interface that every module that connectes to DRAMSys needs to implement.
It requires to implement the \texttt{bindTargetSocket()} method so that top-level initiators can be coupled regardless of the used initiator socket type (e.g. \texttt{simple\_initiator\_socket} or \texttt{multi\_passthrough\_initiator\_socket}).
It requires to implement the \texttt{bindTargetSocket()} method so that top-level initiators can be coupled regardless of the used initiator socket type (e.g., \texttt{simple\_initiator\_socket} or \texttt{multi\_passthrough\_initiator\_socket}).
When coupling a \texttt{multi\_passthrough\_initiator\_socket} to a \texttt{multi\_passthrough\_\\target\_socket}, the SystemC \texttt{bind()} method has to be called multiple times - once for each thread.
Because of this, a wrapper module also has to overwrite the \\ \texttt{getNumberOfThreads()} method of the new interface and use this number to bind the target socket in \texttt{bindTargetSocket()} the correct number of times.
This makes it possible to polymorphically treat all initiator modules, whether they are simple threads or more complex wrapper modules, as this interface and connect them to DRAMSys with the provided bind method, abstracting away the concrete type of initiator socket used.
This makes it possible to polymorphically treat all initiator modules as this interface, whether they are simple threads or more complex wrapper modules, and connect them to DRAMSys with the provided bind method, abstracting away the concrete type of initiator socket used.
So with the new trace player interface, a top-level initiator can either be a single thread, like in previous versions, or a more complex hierarchical module with many internal threads.
With the new trace player interface, a top-level initiator can either be a single thread, like in previous versions, or a more complex hierarchical module with many internal threads.
\subsection{Interconnect}
\label{sec:interconnect}
As already seen in figure \ref{fig:dbiplayer_with_caches}, interconnection modules are needed to connect the caches with each other.
As already seen in Figure \ref{fig:dbiplayer_with_caches}, interconnection modules are needed to connect the caches to each other.
While the implementation of the \textit{MultiCoupler} component is trivial as it only passes the transactions from its so-called \texttt{multi\_passthrough\_target\_socket} to its \texttt{multi\_passthrough\_initiator\_socket}, the \textit{MultiSimpleCoupler} is more complex because it has to internally buffer transactions.
In order to understand why this buffering needed, consider scenario where the L3 cache applies backpressure to one L2 cache.
In order to understand why this buffering is needed, consider the scenario where the L3 cache applies backpressure to one L2 cache.
The L2 cache is not allowed to send further requests due to the exclusion rule.
But since the target socket of the L3 cache is occupied, this also applies to all other other L2 caches.
This information, however, is not propagated to the other caches, leading to an incorrect behavior if not addressed, as the other caches will send further requests.
To solve this problem, the MultiSimpleCoupler only forwards requests to the L3 cache when it is able to accept them.
If this is not the case, the request gets internally buffered and forwarded when an earlier request is being completed with the \texttt{END\_REQ} phase.
If this is not the case, the request is internally buffered and forwarded when an earlier request is being completed with the \texttt{END\_REQ} phase.
% Beispiel
For illustrating this further, a simple example can be assumed:
One L2 cache needs to request a cache line from the underlying L3 cache.
For illustrating this further, a simple example can be assumed: one L2 cache needs to request a cache line from the underlying L3 cache.
The MultiSimpleCoupler receives the \texttt{BEGIN\_REQ} phase and places it into its PEQ.
From there, a hash table used as an internal routing table is updated to be able to send the response back through the correct multi-socket binding afterwards.
As the L3 cache is currently not applying backpressure onto the interconnect, it can forward the transaction with the \texttt{BEGIN\_REQ} phase to the L3 cache.
Until the L3 cache responds with the \texttt{END\_REQ} phase, the interconnect defers any new request from any L2 cache and buffers the payload objects in an internal data structure.
When the \texttt{END\_REQ} phase is received, the next transaction from this request buffer is sent to the L3 cache.
After some time the, L3 cache will respond with the requested cache lines.
After some time, the L3 cache will respond with the requested cache lines.
During this \texttt{BEGIN\_RESP} phase, the L2 cache that requested this line is looked up using the routing table and the payload is sent back to it.
Until the L2 cache responds with an \texttt{END\_RESP}, the exclusion rule has to be honored also here:
When a new response from the L3 cache is received, it has to be buffered into another internal data structure until the corresponding target socket binding is clear again.
Until the L2 cache responds with an \texttt{END\_RESP}, the exclusion rule has to be honored also here: when a new response from the L3 cache is received, it has to be buffered in another internal data structure until the corresponding target socket binding is clear again.
Once the L2 cache sends out the \texttt{END\_RESP} phase, the interconnect will forward the \texttt{END\_RESP} to the L3 cache, and initiate new response transactions in case the response buffer is not empty.
In conclusion, this special interconnect module with an multi-target socket and a simple-initiator socket ensures that the exclusion rule is respected in both directions.
In conclusion, this special interconnect module with a multi-target socket and a simple-initiator socket ensures that the exclusion rule is respected in both directions.