45 lines
4.4 KiB
TeX
45 lines
4.4 KiB
TeX
\section{Conclusion and Future Work}
|
|
\label{sec:future_work}
|
|
|
|
Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
|
|
It not only has a high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
|
|
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
|
|
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
|
|
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, which fastens the process while only making few compromises on accuracy.
|
|
|
|
In conclusion, the newly developed instrumentation tool provides a flexible way of generating traces for arbitrary multi-threaded applications.
|
|
The mature DRAMSys simulator framework then can be used to explore the design space and vary numerous configuration parameters of the DRAM subsystem to find a well-suited set of options.
|
|
|
|
It was shown that in comparison to the well-established full-system simulation framework gem5, only some deviations have to be accepted.
|
|
Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new frontend. %(ergenisse kurz hier zusammenfassen)
|
|
Although Ramulator takes a slightly different approach to trace generation than this thesis, a very good correlation in the results could be demonstrated.
|
|
A noteworthy advantage of the newly developed tool is its support for all hardware architectures that DynamoRIO provides (currently IA-32, x86-64, ARM, and AArch64) in contrast to the supported architectures of Pin (IA-32 and x86-64).
|
|
|
|
Still, there is room for improvement.
|
|
To improve the simulation runtime, a binary trace format could be used instead of the text-based format.
|
|
Both the performance during tracing and parsing should increase by using such a binary format.
|
|
|
|
As mentioned in \ref{sec:cache_implementation}, the cache models do not yet guarantee cache coherency due to the lack of a snooping protocol.
|
|
Although this can be a complex task, it is possible to implement this in future work.
|
|
|
|
A less impactful inaccuracy results from the scheduling of the applications threads in the new simplified core models.
|
|
While an application can spawn an arbitrary number of threads, the platform may not be able to process them all in parallel.
|
|
Currently, the new trace player does not take this into account and runs all threads in parallel.
|
|
This deviation could be prevented by recording used processor cores on the initial system and using this information to better match the scheduling.
|
|
|
|
Another inaccuracy can be caused by the hyperthreading of some of today's processors:
|
|
While hyperthreading enables the parallel processing of two pipelines in a processor core, those threads do share the same first level cache.
|
|
Currently, this is not taken into account, and each application thread is assigned its own first level cache.
|
|
|
|
Further room for improvement offers the consideration of the special prefetch and instructions the architectures provide.
|
|
DynamoRIO already offers an interface to catch those instructions without much effort.
|
|
Support for this would have to be added to the core and cache models as well as the memory trace format.
|
|
|
|
The recorded number of computational instructions between each memory access, which are used to esimate the time between those accesses, is multiplied with the clock period of the trace player.
|
|
However, this is a vast simplification of the real timing behavior of a processor.
|
|
In the future, the DynamoRIO tool could decode those computational instructions and create a better estimate of the execution time of those instructions, based on statistical estimates that have been published before \cite{Abel19a, Fog2022}.
|
|
|
|
One significant improvement that still could be applied is the consideration of dependencies between the memory accesses.
|
|
Similarily to the elastic trace player of gem5 \cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
|
|
By using this technique, it is possible to also model out-of-order behavior of modern processors and make the simulation more accurate, whereas the current implementation is entirely in-order.
|