Improvements to SystemC and DRAMSys chapters

This commit is contained in:
2022-07-13 15:19:42 +02:00
parent a9e7132ed7
commit 49785fad0f
2 changed files with 79 additions and 75 deletions

View File

@@ -3,48 +3,48 @@
This section covers the basics of virtual prototyping, SystemC and transaction level modeling.
\revabbr{Virtual prototypes}{VPs} are software models of physical hardware systems, that can be used for software development before the actual hardware is available.
\revabbr{Virtual prototypes}{VPs} are software models of physical hardware systems, can be used for software development before the actual hardware is available.
They make it easier to test the product as VPs provide visiblity and controllability across the entire system and therefore reduce the time-to-market and development cost \cite{Antonino2018}.
SystemC is a C++ class library with an event-driven simulation kernel, used for developing complex system models (i.e. VPs) in a high-level language.
It is defined under the IEEE 1666-2011 standard \cite{IEEE2012} and provided as an open-source library by Accellera.
SystemC is a C++ class library with an event-driven simulation kernel, used for developing complex system models (i.e., VPs) in a high-level language.
It is defined under the \textit{IEEE 1666-2011} standard \cite{IEEE2012} and is provided as an open-source library by Accellera.
All SystemC modules inherit from the \texttt{sc\_module} base class.
Those modules can hierarchically be composed of other modules or implement their functionality directly.
Ports are then used to connect modules with each other, creating the structure of the simulation.
There are two ways to implement a process in a module:
% \begin{itemize}
% \item
An \texttt{SC\_METHOD} are sensitive to \texttt{sc\_event}s or other signals.
They can be executed multiple times.
% \item
Ports are used to connect modules with each other, creating the system structure of the simulation.
There are two options to implement a process in a module:
\begin{itemize}
\item
An \texttt{SC\_METHOD} is sensitive to \texttt{sc\_event}s or other signals.
Methods are executed multiple times, each time they are triggered by their sensitivity list.
\item
An \texttt{SC\_THREAD} is started at the beginning of the simulation and should not terminate.
Instead, threads should contain infinite loops and should call explicitly \texttt{wait()} to wait a specific time or on events.
% \end{itemize}
Moreover, there is \texttt{sc\_event\_queue} which makes it possible to queue multiple pending events, where as an \texttt{sc\_event} ignores further notifications until it is waited on.
Instead, threads should contain infinite loops and should explicitly call \texttt{wait()} to wait a specific time or for events.
\end{itemize}
Moreover, there is the event queue type \texttt{sc\_event\_queue}, which makes it possible to queue multiple pending events, where as an \texttt{sc\_event} ignores further notifications until it is waited on.
Those concepts being introduced will become important in Section \ref{sec:implementation} where the implementation of several SystemC modules will be discussed.
The concepts presented are used in Section \ref{sec:implementation}, where the implementation of various SystemC modules will be discussed.
SystemC supports numerous abstraction levels for modeling systems, namely \textit{cycle-accurate}, which is the most accurate abstraction but also the slowest, \textit{approximateley-timed} and \textit{loosley-timed}.
SystemC supports a number of abstraction levels for modeling systems, namely \textit{cycle-accurate}, the most accurate but also the slowest abstraction, \textit{approximateley-timed} and \textit{loosley-timed}.
The latter two abstraction levels belog to \revabbr{transaction level modeling}{TLM}, which will be discussed in the next Section \ref{sec:tlm}.
One further abstraction level, \textit{untimed}, will not be topic of this thesis.
Another level of abstraction, \textit{untimed}, will not be the subject of this thesis.
\subsection{Transaction Level Modeling}
\label{sec:tlm}
TLM abstracts the modeling of the communication between modules using so-called transactions, which are transferred through function calls \cite{Menard2017}.
In contrast to pin and cycle accurate models, this greatly reduces the simulation overhead at the cost of reduced accuracy.
TLM abstracts the modeling of the communication between modules using so-called transactions, which are data packets transferred by function calls \cite{Menard2017}.
In comparison to pin and cycle accurate modeling, this greatly reduces the simulation overhead at the cost of reduced accuracy.
Modules communicate with each other through \textit{initiator} sockets and \textit{target} sockets.
A processor, for example, sends requests to a memory using its initiator socket, whereas the memory responds trough its target socket.
Interconnect modules, which can be used to model a bus, use both sockets to communicate with both initiator and the target modules.
For example, a processor sends requests to a memory through its initiator socket, while the memory responds through its target socket.
Interconnection modules, which can be used to model a bus, use both socket types to communicate with both initiator and the target modules.
This concept is illustrated in Figure \ref{fig:tlm}.
The transaction object itself is a \revabbr{generic payload}{GP}, which consists of the target address, whether the transaction is a read or write command, status information and other transaction parameters as well as the actual data to transfer.
GPs are passed along as references, avoiding the need to copy them between the modules.
The transaction object itself is a \revabbr{generic payload}{GP}, which consists of the target address, whether the transaction is a read or write command, status information and other transaction parameters, the actual data to transfer as well as user-defined payload extensions.
GPs are passed as references, so they do not need to be copied between modules.
\input{img/thesis.tikzstyles}
\begin{figure}[!ht]
\begin{figure}
\begin{center}
\tikzfig{img/tlm}
\caption[Forward and backward path between TLM sockets\cite{Menard2017}.]{Forward and backward path between TLM sockets\cite{Menard2017}. $\blacksquare$ denotes an initiator socket, $\square$ denotes a target socket.}
@@ -53,26 +53,27 @@ GPs are passed along as references, avoiding the need to copy them between the m
\end{figure}
SystemC defines two coding styles for the use of TLM, called \revabbr{loosley-timed}{LT} and \revabbr{approximateley-timed}{AT}.
In the LT coding style, a transaction is blocking, meaning it will be modeled by only one function call.
This comes at the cost of limited timing accuracy as only the beginning and the end of the transaction are modeled as timing points and the initiator has to wait for the transaction to complete until it can make the next request.
However, the fast simulation time, especially when \textit{temporal decoupling} with \textit{quantums} is used, makes it possible to use this coding style for rapid software development, like developing drivers for a simulated hardware component.
For such a task the timing accuracy is sufficient.
In the LT coding style, a transaction is blocking, meaning that the transaction will be modeled by only one function call.
This comes at the cost of limited temporal accuracy, as only the start and end times of the transaction are modeled, and the initiator must wait until the transaction is complete before making the next request.
However, the fast simulation time, especially when the so-called concept of \textit{temporal decoupling} with \textit{timing quantums} is used, makes it possible to use this coding style for rapid software development; LT is suitable for developing drivers for a simulated hardware component.
The AT coding style is non-blocking and therefore can be used to model with a higher timing accuracy than LT.
This high accuracy makes it possible to use AT to conduct design space exploration on a hardware level.
With AT, a special protocol is used that consists of a four-phase handshake:
This high accuracy makes it possible to use AT for hardware-level design space exploration.
With AT, a special protocol is used that uses a four-phase handshake:
\texttt{BEGIN\_REQ},
\texttt{END\_REQ},
\texttt{BEGIN\_RESP} and
\texttt{END\_RESP}.
When an initiator requests certain data from a target, it starts the transaction with the \texttt{BEGIN\_REQ} phase using its \texttt{nb\_transport\_fw()} method.
The target now enqueues the payload into its \revabbr{payload event queue}{PEQ} and pretends it has received the payload after the delay the initiator has specified.
When the target is not yet ready to accept a new request, it defers its \texttt{END\_REQ} phase until it is.
When an initiator requests data from a target, it starts the transaction with the \texttt{BEGIN\_REQ} phase by calling its \texttt{nb\_transport\_fw()} method.
This method in turn calls the receiving module's target socket and the target module then enqueues the payload into its \revabbr{payload event queue}{PEQ}.
The PEQ pretends it has received the payload after the delay, that the initiator has specified with its call to the transport method.
If the target is not yet ready to accept a new request, it defers its \texttt{END\_REQ} phase until it is ready.
During this time, the initiator is blocked from sending further requests either to this or other modules as the target applies \textit{backpressure} on the initiator.
This concept is called the \textit{exclusion rule}.
Otherwise, the target directly responds the \texttt{END\_REQ} phase back to the initiator.
The target now prepares the response and sends the \texttt{BEGIN\_RESP} phase through its \texttt{nb\_transport\_bw()} method when the data is available.
The target then prepares the response and sends the \texttt{BEGIN\_RESP} phase through its \texttt{nb\_transport\_bw()} method when the data is available.
The initiator can now also apply backpressure to the target by deferring its \texttt{END\_RESP} phase.
When the \texttt{END\_RESP} phase is received by the target, the transaction is completed.
% hier komplexeres handshake beispiell
@@ -87,25 +88,27 @@ Figure \ref{fig:tlm_at} shows an exemplary handshake sequence diagram of three d
\end{figure}
SystemC defines various special cases and shortcuts that can be used troughout the protocol.
Both in the \texttt{BEGIN\_REQ} phase as well as in the \texttt{BEGIN\_RESP} phase, it is possible for the target to skip the \texttt{END\_REQ} phase or for the initiator to skip the \texttt{END\_RESP} phase respectively using the return value of the forward or backward transport function call.
For this the return type \texttt{tlm\_sync\_enum} has to be set to \texttt{TLM\_UPDATED} instead of \texttt{TLM\_ACCEPTED} in the normal case.
In the \texttt{BEGIN\_REQ} phase, it is possible for the target to directly send the \texttt{END\_REQ} phase using the return value of the forward transport function call.
This requires setting the return type \texttt{tlm\_sync\_enum} to \texttt{TLM\_UPDATED} instead of \texttt{TLM\_ACCEPTED} in the normal case.
Analogously, it is also possible for the initiator to directly respond with the \texttt{END\_RESP} phase using the return value during the \texttt{BEGIN\_RESP} phase.
Besides using the return path to skip the \texttt{END\_REQ} phase, it is possible for the target to directly respond with the \texttt{BEGIN\_RESP} phase.
Besides this, it is also possible for the target to directly respond with the \texttt{BEGIN\_RESP} phase after it has received the \texttt{BEGIN\_REQ} phase and therefore skip the \texttt{END\_REQ}.
The initiator has to react accordingly and must detect that the \texttt{END\_REQ} has been skipped.
However, since the initiator is blocked due to backpressure, this shortcut should only be used if the response is ready to send after a very short delay.
Another form of this shortcut, is the combination with the return path of the forward transport function call.
Here, the return path is used to directly send the \texttt{BEGIN\_REQ} phase, without invoking the backward transport function, reducing the required number of transport calls to only two.
However, since the initiator is blocked due to backpressure during this period, this shortcut should only be used if the response is ready to send after a short delay.
Another form of this shortcut is the combination with return path of the forward transport function call.
Here, the return path is used to directly send the \texttt{BEGIN\_REQ} phase, without invoking the backward transport function altogether, reducing the required number of transport calls to only two.
The last shortcut, that can be made is the so-called \textit{early completion}.
When the target receives the \texttt{BEGIN\_REQ} phase, it can already place the requested data into the payload and pass \texttt{TLM\_COMPLETED} as the return value back to the initiator.
This notifies that the whole transaction is already completed at this point, so no further transport calls are required.
Note that this form of early completion is very similar to the LT coding style, where a transaction is modeled using only one function call.
Early completion can also be used by the initiator to skip the \texttt{END\_REQ} phase.
Here, \texttt{TLM\_COMPLETED} is returned during the backward transport call and thus, the target experiences no backpressure from the initiator.
Note that this form of early completion is very similar to the LT coding style, where a transaction also is modeled using only one function call.
Early completion can also be used by the initiator to skip the \texttt{END\_RESP} phase.
Here, \texttt{TLM\_COMPLETED} is returned during the backward transport call of the \texttt{BEGIN\_RESP} phase.
SystemC also supports additional user-defined phases through its \texttt{DECLARE\_EXTENDED\_\\PHASE()} macro.
In contrast to the TLM-LT protocol, TLM-AT allows model pipelining of transactions; multiple transactions can be processed simultaneously by one target.
The responses also do not need to be in the same order as the initiator has sent them: they can be \textit{out out order}.
SystemC also supports additional user-defined phases through its \texttt{DECLARE\_EXTENDED\_\\PHASE()} macro for special cases.
The TLM-AT coding style is the used protocol to implement the processor model and the cache model in Section \ref{sec:implementation} of this thesis.
Also, some of the earlier described shortcuts are taken advantage of throughout those models.
In contrast to the TLM-LT protocol, TLM-AT allows to model the pipelining of transactions; multiple transactions can be processed simultaneously by one target.
The responses also do not need to be in the same order as the initiator has sent them; they can be \textit{out out order}.
The TLM-AT coding style is the protocol used to implement the processor model and the cache model in Section \ref{sec:implementation} of this thesis.
Some of the earlier described shortcuts are taken advantage of throughout those models.

View File

@@ -2,10 +2,10 @@
\label{sec:dramsys}
DRAMSys is an open-source design space exploration framework, capable of simulating the latest \revabbr{Joint Electron Device Engineering Council}{JEDEC} DRAM standards.
It is optimized to achieve high simulation speeds and utilizes the TLM-AT coding style while still achieving cycle accurate results \cite{Steiner2020}.
It is optimized to achieve high simulation speeds and utilizes the TLM-AT coding style while still achieving cycle-accurate results \cite{Steiner2020}.
DRAMSys is composed of an arbitration \& mapping unit (also called arbiter) and independent channel controllers, each driving one DRAM device.
The general architecture of DRAMSys is illustrated in figure \ref{fig:dramsys}.
DRAMSys is composed of an arbitration \& mapping unit (also called arbiter) and independent channel controllers, each driving one DRAM channel.
The general architecture of DRAMSys is illustrated in Figure \ref{fig:dramsys}.
\begin{figure}[!ht]
\begin{center}
@@ -15,51 +15,52 @@ The general architecture of DRAMSys is illustrated in figure \ref{fig:dramsys}.
\end{center}
\end{figure}
% doch noch über interne funktionen schreiben
Several initiators can be connected to DRAMSys at the same time, sending requests independently to the DRAM subsystem.
An initiator can either be a sophisticated processor model like the gem5 out of order processor model \cite{Binkert2011} or a trace player that simply replays a trace file containing a sequence of memory requests and timestamps.
Multiple initiators can be connected to DRAMSys simultaneously and send requests to the DRAM subsystem independently.
An initiator can either be a sophisticated processor model like the gem5 out of order processor model \cite{Binkert2011} or a more simple trace player that replays a trace file containing a sequence of memory requests with timestamps.
To support a large variety of DRAM standards robustly and error-free, DRAMSys uses a formal domain specific language based on Petri nets called DRAMml.
This language includes a standards timing dependencies between all DRAM commands and compiles to source code of the internal timing checkers that ensure compliance to the specific standard \cite{Jung2017a}.
To support a variety of DRAM standards in a robust and error-free manner, DRAMSys uses a formal domain-specific language based on \textit{Petri nets} called \textit{DRAMml}.
Using this language, all timing dependencies between DRAM commands of a standard can be defined.
From this formal description, the source code of internal timing checkers is generated, which ensure compliance to the specific standard \cite{Jung2017a}.
Since a single memory access can result in the issuance of multiple commands (e.g. a precharge (\texttt{PRE}), an activate (\texttt{ACT}), a read (\texttt{RD}) or a write (\texttt{WR})), the four phase handshake of the TLM-AT protocol is not sufficient.
Since a single memory access can result in the issuance of multiple commands (e.g. a precharge (\texttt{PRE}), an activate (\texttt{ACT}), a read (\texttt{RD}) or a write (\texttt{WR})), the four-phase handshake of the TLM-AT protocol is not sufficient to model the communication between the DRAM controller and the DRAM device.
Therefore, a custom TLM protocol called DRAM-AT is used as the communication protocol between the channel controller and the DRAM device \cite{Steiner2020}.
This custom protocol introduces a \texttt{BEGIN} and \texttt{END} phase for every available DRAM command.
Which commands can be issued depends on the used DRAM standard.
Which commands can be issued depends on the DRAM standard used.
Some of the internal modules and their functionality will now be explained.
The task of the \textit{arbiter} is to accept the incoming transactions from the various initiators and decode the address according to the configured address mapping.
From there the transactions are passed to the channel controller.
Some of the internal modules of DRAMSys and their functionalities will now be explained.
The task of the \textit{arbiter} is to accept the incoming transactions from the various initiators and to decode the address according to the configured address mapping.
From there, the transactions are passed to the channel controller.
The channel controller is the centerpiece of the DRAM simulation, consisting of a \textit{scheduler}, \textit{bank machines}, \textit{refresh managers}, \textit{power down managers}, a \textit{response queue} and a \textit{command multiplexer}.
New incoming requests get placed into the scheduler.
The purpose of the scheduler is to group transactions by their accessed memory bank and reorder the payloads according to a predefined policy.
The channel controller is the most important module of the DRAM simulation, consisting of a \textit{scheduler}, \textit{bank machines}, \textit{refresh managers}, \textit{power down managers}, a \textit{response queue} and a \textit{command multiplexer}.
New incoming requests are placed into the scheduler.
The purpose of the scheduler is to group transactions by their corresponding memory bank and reorder the payloads according to a predefined policy.
Available policies are, for example, the \textit{first-in, first-out} or the \textit{first-ready - first-come, first-served} policy.
The former policy does not reorder payloads and therefore optimizes for a short response latency and whereas the latter policy does reorder payloads and optimizes for a high memory bandwidth.
A bank machine, whose responsibility is to manage the state of its corresponding memory bank, then fetches the next transaction from the scheduler.
There are also a number of available policies for the bank machines, each of which determine in which state the bank is being held after a completed memory request.
There are also a number of available policies for the bank machines, each of which determine in which state the bank is being held after memory request is completed.
With the fetched transaction, the bank machine then selects the command that it needs to send to its memory bank to enforce its policy.
With the fetched transaction, the bank machine then selects the command that it needs to send to its memory bank.
However, the selected command can not be sent instantaneously to the DRAM, as complex timing constraints need to be satisfied before the issuance of a specific command.
To obey those timing constraints, the bank machine verifies through the so-called \textit{timing checker} that the selected command is allowed to be sent to the memory.
The bank machine then tries to enque the command, so that the controller can send it out.
To meet these timing requirements, the bank machine uses the so-called \textit{timing checker} to check whether the selected command may be sent to memory.
The bank machine then tries to enque the command, so that the controller can send it to the DRAM.
The task of the command multiplexer is to select one command out of all commands that have been enqueued by the bank machines, the refresh managers or the power down managers.
The command multiplexer also has a set of configurable policies, that determine the commands individual priorities.
The selected command then will be sent out to the DRAM by the controller.
The command multiplexer also has a set of configurable policies, that determine the individual priorities of the commands.
The selected command is then sent to the DRAM by the controller.
The last important module to mention is the response queue.
The completed DRAM transactions are enqueued into the response queue to send the responses back to the initiators.
In the response queue, the responses can either be passed to the initiator using the \textit{first-in, first-out} scheme, or firstly be reordered in the queue itself.
A completed DRAM transaction is enqueued into the response queue by the controller to send the responses back to the initiators.
In the response queue, transactions can either be returned to the initiator according to the scheme \textit{first-in, first-out} or be reordered in the queue.
A reordering might be necessary to be able to support initiators that can not handle \textit{out-out-order} responses.
% Evtl TA falls Bilder genutzt werden?
DRAMSys also provides the so-called \textit{Trace Analyzer}, a graphical tool that visualizes database files created by DRAMSys.
It shows the \texttt{REQ} and \texttt{RESP} phases between the initiator and the arbiter, the occupation of the command bus and data bus as well as representations of the different phases in the DRAM banks.
It makes visible the \texttt{REQ} and \texttt{RESP} phases between the initiator and the arbiter, the occupation of the command bus and data bus as well as representations of the different phases in the DRAM banks.
An example trace database, visualized in the Trace Analyzer is shown in Figure \ref{fig:traceanalyzer}.
Furthermore, the Trace Analyzer is capable of calculating numerous metrics and creating plots of interesting characteristics.
\begin{figure}%[!ht]
\begin{figure}
\begin{center}
\includegraphics[width=\linewidth]{img/traceanalyzer.pdf}
\caption[Exemplary visualization of a trace database in the Trace Analyzer.]{Exemplary visualization of a trace database in the Trace Analyzer. The used DRAM consists of one rank and eight bank groups with two banks each.}
@@ -67,4 +68,4 @@ Furthermore, the Trace Analyzer is capable of calculating numerous metrics and c
\end{center}
\end{figure}
In Section \ref{sec:implementation} of this thesis the new special traffic generator for DRAMSys will be developed.
In Section \ref{sec:implementation} of this thesis, a new simulation frontend for DRAMSys will be developed.