Caches begin

2022-05-21 19:59:44 +02:00
parent 0c166fce8d
commit 696b2b05d2
6 changed files with 191 additions and 1 deletions
--- a/Bachelorarbeit.kilepr
+++ b/Bachelorarbeit.kilepr
@@ -60,6 +60,12 @@ encoding=UTF-8
 highlight=LaTeX
 mode=LaTeX
 [item:inc/4.caches.tex]
 archive=true
 encoding=UTF-8
 highlight=LaTeX
 mode=LaTeX
 [item:inc/6.implementation.tex]
 archive=true
 encoding=UTF-8
--- a/doc.bib
+++ b/doc.bib
@@ -30,4 +30,17 @@
  url       = {http://doi.acm.org/10.1145/3297858.3304062},
 }
@Book{Jacob2008,
  author    = {B. Jacob and S. W. Ng and D. T. Wang},
  publisher = {Morgan Kaufmann},
  title     = {Memory Systems: Cache, DRAM, Disk},
  year      = {2008},
 }
@Article{Jahre2007,
  author = {Jahre, M. and Natvig, L.},
  title  = {Performance Effects of a Cache Miss Handling Architecture in a Multi-core Processor},
  year   = {2007},
 }
@Comment{jabref-meta: databaseType:bibtex;}
--- a/doc.tex
+++ b/doc.tex
@@ -164,6 +164,10 @@
 \newpage
 \clearpage
 \input{inc/4.caches}
 \newpage
 \clearpage
 \input{inc/6.implementation}
 \newpage
 \clearpage
--- a/img/address.tikz
+++ b/img/address.tikz
@@ -0,0 +1,37 @@
 \begin{tikzpicture}
 	\begin{pgfonlayer}{nodelayer}
 		\node [style=none] (0) at (0, 0) {};
 		\node [style=none] (2) at (13, 0) {};
 		\node [style=none] (3) at (13, -1) {};
 		\node [style=none] (4) at (19, 0) {};
 		\node [style=none] (5) at (19, -1) {};
 		\node [style=none] (6) at (24, 0) {};
 		\node [style=none] (7) at (24, -1) {};
 		\node [style=none] (9) at (0, -1) {};
 		\node [style=none] (10) at (6.5, -0.5) {Tag};
 		\node [style=none] (11) at (16, -0.5) {Index};
 		\node [style=none] (12) at (21.5, -0.5) {Byte Offset};
 		\node [style=none] (13) at (0.5, 0.5) {31};
 		\node [style=none] (14) at (23.5, 0.5) {0};
 		\node [style=none] (15) at (19.5, 0.5) {3};
 		\node [style=none] (16) at (18.5, 0.5) {4};
 		\node [style=none] (17) at (13.5, 0.5) {13};
 		\node [style=none] (18) at (12.5, 0.5) {14};
 		\node [style=none] (19) at (6.5, 0.5) {\dots};
 		\node [style=none] (20) at (16, 0.5) {\dots};
 		\node [style=none] (21) at (21.5, 0.5) {\dots};
 	\end{pgfonlayer}
 	\begin{pgfonlayer}{edgelayer}
 		\draw (3.center)
 			 to (9.center)
 			 to [in=270, out=90] (0.center)
 			 to (2.center)
 			 to (4.center)
 			 to (6.center)
 			 to (7.center)
 			 to (5.center)
 			 to cycle;
 		\draw (2.center) to (3.center);
 		\draw (4.center) to (5.center);
 	\end{pgfonlayer}
 \end{tikzpicture}
--- a/inc/2.dynamorio.tex
+++ b/inc/2.dynamorio.tex
@@ -11,7 +11,7 @@ It is mainly based on on the chapter \textit{DynamoRIO} and \textit{Code Cache}
 This is achieved through the injection of additional instructions into the instruction trace of the target application.
 Debuggers on the other hand, use special breakpoint instructions (e.g. INT3 on x86 or BKPT on ARM) that get injected at specific places in the code, raising a debug exception when reaching it.
-At those exceptions a context switch to the operating system kernel will be performed, however, those context switches result in a significant performance penalty as the processor state has to be saved and restored afterwards. (TODO irgendwie literatur referenz hier)
+At those exceptions a context switch to the operating system kernel will be performed, however, those context switches result in a significant performance penalty as the processor state has to be saved and restored afterwards.
 Because the instrumentation tool runs in the same process as the application, it is important that it operates transparently, meaning that it will not affect the application behavior in unintended ways.
 This is a special challenge as the dynamic instrumentation is not allowed to use the same memory routines or input/output buffering as the target application \cite{Bruening2003}.
--- a/inc/4.caches.tex
+++ b/inc/4.caches.tex
@@ -0,0 +1,130 @@
 \section{Caches}
 \label{sec:caches}
 In this section, the necessity and functionality of caches in modern computing systems will be explained as well as the required considerations resulting from virtual memory addressing.
 A special focus will also be placed on non-blocking caches.
 The theory will be based on the chapters \textit{An Overview of Cache Principles} and \textit{Logical Organization} of \cite{Jacob2008} and on \cite{Jahre2007}.
 With the advancement of faster multi-core processors, the performance difference to the main \revabbr{dynamic random-access memory}{DRAM} is increasing, commonly referred to as the \textit{memory wall}.
 Therefore caches, whose goal is to decrease the latency and increase the bandwidth of an access to the memory subsystem, play an important role when it comes to the overall performance of computing systems.
 Caches are faster than DRAM, but only provide a small capacity, as the per-bit cost is larger.
 For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache.
 The two most important heuristics that make this possible will be explained in section \ref{sec:caches_locality_principles}.
 After that the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}, followed by the considerations to make when it comes to virtual addressing in section \ref{sec:caches_virtual_addressing}.
 Finally, the advantage of non-blocking caches is the topic of section \ref{sec:caches_non_blocking_caches}.
 TODO update
 \subsection{Locality Principles}
 \label{sec:caches_locality_principles}
 Access patterns of a typical application are not random.
 They tend to repeat themselves in time or are located in the near surrounding of previous accesses.
 Those two heuristics are called \textit{temporal locality} and \textit{spatial locality}.
 \subsubsection{Temporal Locality}
 Referenced data is likely to be referenced again by the application in the future.
 This is most important characteristic that make it possible for a cache to optimize the access latency.
 When new data is referenced, it will be fetched from the main memory and kept in the cache.
 Operations using this data can now perform calculations and use the end result further by only accessing the cache, exploiting this tendency of the application.
 \subsubsection{Spatial Locality}
 Programs have a tendency to reference data that is nearby already referenced data in the memory space.
 This is because related data is often clustered together, for example in arrays or structures.
 When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other.
 This tendency can be exploited by organizing blocks of data in so called \textit{cache blocks} or \textit{cache lines} which are larger than a single data word.
 This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses.
 An active form of exploiting spatial locality is the use of \textit{prefetching}.
 Here, the program causes the cache to fetch more than one cache line from the underlying memory system.
 \subsection{Logical Organization}
 \label{sec:caches_logical_organization}
 This section concerns the question where to store the fetched data in the cache.
 Because the cache is much smaller than the DRAM, only a subset of the memory can be held in the cache at a time.
 Into which cache line a block of memory placed is determined by the \textit{placement policy}.
 There are three main policies:
 \begin{itemize}
 \item
    In \textit{direct-mapped caches} the cache is divided into multiple sets with a single cache line in each set.
    For every address there is only one cache line where the data can be placed in.
 \item
    A \textit{fully associative cache} there is only one large set, containing all available cache lines.
    Referenced data has no restriction in which cache line it can be placed.
 \item
    \textit{Set-associative caches} are a hybrid form of the former two: There are multiple sets containing several cache lines each.
    The address determines the corresponding set, in that the data can be placed in any of the cache lines.
 \end{itemize}
 In all three cases, the least significant portion of the physical address of the referenced data, the \textit{index}, determines the set in which the data is to store.
 However, several entries in the DRAM map to the same set, so the remaining most significant portion of the address is used as a \textit{tag} and is stored next to the actual data in the cache line.
 After an entry is fetched from the cache, the tag is used to determine if the entry actually corresponds to the referenced data.
 An example subdivision of the address in the index, tag and byte offset is shown in figure \ref{fig:address_mapping}.
 \input{img/thesis.tikzstyles}
 \begin{figure}[!ht]
 \begin{center}
 \tikzfig{img/address}
 \caption{Example address mapping for the tag, index and byte offset.}
 \label{fig:address_mapping}
 \end{center}
 \end{figure}
 Directly-mapped caches have the advantage, that only one tag has to be compared with the address.
 However, every time new data is referenced that gets placed into the same set, the cache line will be evicted.
 This leads to an overall lower cache hit rate as the other two policies.
 In a fully associative cache, a memory reference can be placed anywhere, consequently all cache lines have to be fetched and compared to the tag.
 Although this policy has the highest potential cache hit rate, the high space consumption due to comparators and high power consumption due to the lookup process, makes it non-feasable for many systems.
 The hybrid approach of set-associative caches offers a trade-off between both policies.
 \subsection{Replacement Policies}
 \label{sec:replacement_policies}
 In case of contention, cache lines have to be evicted.
 To determine which one of the corresponding set, there are several replacement policies:
 \begin{itemize}
 \item
    The random policy selects a cache line of a set at random.
 \item
    The \revabbr{least recently used}{LRU} policy selects the cache line whose last usage is the longest time ago.
    A LRU algorithm is expensive to implement, a counter value for every cache line of a set has to be updated every time the set is accessed.
 \item
    An alternative is a \revabbr{Pseudo-LRU}{PLRU} policy, where one bit is set to 1 every time a cache line is accessed.
    When the extra bit of every cache line in a set is set to 1, they will get reset to 0.
    In case of an contention, the first cache line whose extra bit is 0 will be evicted, which indicates that the last usage was some time ago.
 \item
    In the \revabbr{least frequently used}{LFU} policy, every time a cache line is accessed, a counter value will be increased.
    The cache line with the lowest value, the least frequently used one, will be chosen to be evicted.
 \item
    The \revabbr{first in first out}{FIFO} policy evicts the cache lines in the same order they were placed.
 \end{itemize}
 \subsection{Write Policies}
 \label{sec:write_policies}
 To maintain consistency to the underlying memory subsystem, special care has to be taken when a write access occurs.
 In case of a \textit{write-through} cache, the underlying memory is updated immediately, meaning the updated value will also directly be written in the DRAM.
 Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty.
 To mitigate the problem, a write buffer can be used, which allows the processor to make further progress as the data is written.
 An alternative is a so called \textit{write-back} cache.
 Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted.
 Also here, a write buffer can be used to place the actual write back requests into a queue.
 \subsection{Virtual Addressing}
 \label{sec:caches_virtual_addressing}
 \subsection{Non-blocking Caches}
 \label{sec:caches_non_blocking_caches}