bachelor-thesis/inc/8.future_work.tex

\section{Conclusion and Future Work}
\label{sec:future_work}

Due to the complexity of possible memory subsystem configurations, simulation is an indispensable part of the development process of today's systems.
It not only has a high impact on the development cost but also significantly reduces the time-to-market and enables the rapid release of new products.
However, the accurate simulation of a specific application takes a large period of time because of the detailed processor core models.
On the other hand, fixed or relative time memory traces allow faster simulation at the expense of accuracy, which makes them often unsuitable.
To fill this gap, this thesis introduced a new simulation frontend for DRAMSys, which fastens the process while only making few compromises on accuracy.

In conclusion, the newly developed instrumentation tool provides a flexible way of generating traces for arbitrary multi-threaded applications.
The mature DRAMSys simulator framework then can be used to explore the design space and vary numerous configuration parameters of the DRAM subsystem to find a well-suited set of options.

It was shown that in comparison to the well-established full-system simulation framework gem5, only some deviations have to be accepted.
Also, the Pin-Tool based memory access tracing of the Ramulator DRAM simulator was compared to the new frontend. %(ergenisse kurz hier zusammenfassen)
Although Ramulator takes a slightly different approach to trace generation than this thesis, a very good correlation in the results could be demonstrated.
A noteworthy advantage of the newly developed tool is its support for all hardware architectures that DynamoRIO provides (currently IA-32, x86-64, ARM, and AArch64) in contrast to the supported architectures of Pin (IA-32 and x86-64).

Still, there is room for improvement.
To improve the simulation runtime, a binary trace format could be used instead of the text-based format.
Both the performance during tracing and parsing should increase by using such a binary format.

As mentioned in \ref{sec:cache_implementation}, the cache models do not yet guarantee cache coherency due to the lack of a snooping protocol.
Although this can be a complex task, it is possible to implement this in future work.

A less impactful inaccuracy results from the scheduling of the applications threads in the new simplified core models.
While an application can spawn an arbitrary number of threads, the platform may not be able to process them all in parallel.
Currently, the new trace player does not take this into account and runs all threads in parallel.
This deviation could be prevented by recording used processor cores on the initial system and using this information to better match the scheduling.

Another inaccuracy can be caused by the hyperthreading of some of today's processors:
While hyperthreading enables the parallel processing of two pipelines in a processor core, those threads do share the same first level cache.
Currently, this is not taken into account, and each application thread is assigned its own first level cache.

Further room for improvement offers the consideration of the special prefetch and instructions the architectures provide.
DynamoRIO already offers an interface to catch those instructions without much effort.
Support for this would have to be added to the core and cache models as well as the memory trace format.

The recorded number of computational instructions between each memory access, which are used to esimate the time between those accesses, is multiplied with the clock period of the trace player.
However, this is a vast simplification of the real timing behavior of a processor.
In the future, the DynamoRIO tool could decode those computational instructions and create a better estimate of the execution time of those instructions, based on statistical estimates that have been published before \cite{Abel19a, Fog2022}.

One significant improvement that still could be applied is the consideration of dependencies between the memory accesses.
Similarily to the elastic trace player of gem5 \cite{Jagtap2016}, which captures data load and store dependencies by instrumenting a detailed out-of-order processor model, the DynamoRIO tool could create a dependency graph of the memory accesses using the decoded instructions.
By using this technique, it is possible to also model out-of-order behavior of modern processors and make the simulation more accurate, whereas the current implementation is entirely in-order.