46 lines
4.4 KiB
TeX
46 lines
4.4 KiB
TeX
\section{Future Work}
|
|
\label{sec:future_work}
|
|
|
|
Due to the complexity of possible memory sub-system configurations, simulation is an indispensable part of the development process of today's systems.
|
|
It not only has an high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
|
|
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
|
|
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes it often unsuitable.
|
|
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, that is fast and makes only few compromises on accuracy.
|
|
|
|
In conclusion, the newly developed instrumentation tool provides an flexible way of generating traces for arbitrary multi-threaded applications.
|
|
The mature DRAMSys simulator framework then can be used to explore the design space and vary numerous configuration parameters of the DRAM subsystem to find a well-suited set of options.
|
|
|
|
It was shown that in comparison to the well-established full-system simulation framework gem5, only small deviations have to be accepted.
|
|
Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new fronted. %(ergenisse kurz hier zusammenfassen)
|
|
A noteworthy advantage of the newly developed tool is its support for all hardware architectures that DynamoRIO provides (currently IA-32, x86-64, ARM, and AArch64) in contrast to the supported architectures of Pin (IA-32 and x86-64).
|
|
|
|
Still, there is room for improvement.
|
|
To improve the simulation runtime, a binary trace format could be used instead of the text-based format.
|
|
Both the performance during tracing and parsing should increase by using such a binary format.
|
|
|
|
As mentioned in \ref{sec:cache_implementation}, the cache models do not yet guarantee cache coherency due to the lack of a snooping protocol.
|
|
Although this can be a complex task, it is possible to implement this in future work.
|
|
|
|
A less impactful inaccuracy results from the scheduling of the applications threads in the new simplified core models.
|
|
While an application can spawn a arbitrary number of threads, the platform may not be able to process them all in parallel.
|
|
Currently, the new trace player does not take this into account and runs all threads in parallel.
|
|
This deviation could be prevented by recording used processor cores on the initial system and using this information to better match the scheduling.
|
|
|
|
Another inaccuracy can be caused by the hyperthreading of some of today's processors:
|
|
While hyperthreading enables the parallel processing of two pipelines in a processor core, those threads do share the same first level cache.
|
|
Currently, this is not taken into account and every application thread gets its own first level cache assigned.
|
|
|
|
Further room for improvement offers the consideration of the special prefetch and instructions the architectures provide.
|
|
DynamoRIO already offers an interface to catch those instructions without much effort.
|
|
Support for this would have to be added to the core and cache models as well as the memory trace format.
|
|
|
|
The recorded number of computational instructions between each memory access, which are used to esimate the time between those accesses, is multiplied with the clock period of the trace player.
|
|
However, this is a vast simplification of the real timing behavior of a processor.
|
|
In the future, the DynamoRIO tool could decode those computational instructions and create a better estimate of the execution time of those instructions, based on statistical estimates that have been published before\cite{Abel19a}\cite{Fog2022}.
|
|
|
|
One significant improvement that still could be applied is the consideration of dependencies between the memory accesses.
|
|
Similarily to the elastic trace player of gem5\cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
|
|
By using this technique, it is possible to also model out-of-order behavior of modern processors and make the simulation more accurate, whereas the current implementation is entirely in-order.
|
|
|
|
These mentioned potential improvements could make the new simulation frontend for dramsys even more accurate.
|