Apply Lukas' corrections

This commit is contained in:
2022-07-13 11:27:04 +02:00
parent d890c4cc79
commit a9e7132ed7
12 changed files with 280 additions and 224 deletions

View File

@@ -1,42 +1,41 @@
\section{Caches}
\label{sec:caches}
In this section, the necessity and functionality of caches in modern computing systems will be explained as well as the required considerations resulting from virtual memory addressing.
A special focus will also be placed on non-blocking caches.
The theory will be based on the chapters \textit{An Overview of Cache Principles} and \textit{Logical Organization} of \cite{Jacob2008} and on \cite{Jahre2007}.
In this section, the necessity and functionality of caches in modern computing systems is explained as well as the required considerations resulting from virtual memory addressing.
A special focus is also be placed on non-blocking caches.
The theory is be based on the chapters \textit{``An Overview of Cache Principles''} and \textit{``Logical Organization''} of \cite{Jacob2008} and on \cite{Jahre2007}.
With the advancement of faster multi-core processors, the performance difference to the main \revabbr{dynamic random-access memory}{DRAM} is increasing, commonly referred to as the \textit{memory wall}.
Therefore caches, whose goal is to decrease the latency and increase the bandwidth of an memory access, play an important role when it comes to the overall performance of computing systems.
With the advancement of faster multi-core processors, the performance difference to the main memory is increasing, commonly referred to as the \textit{memory wall}.
Therefore, caches, whose goal is to decrease the latency and increase the bandwidth of a memory access, play an important role when it comes to the overall performance of computing systems.
Caches are faster than DRAM, but only provide a small capacity, as the per-bit cost is larger.
Caches are faster than DRAM, but only provide a small capacity, as the area cost is a lot higher.
For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache to improve performance.
The two most important heuristics that make this possible will be explained in section \ref{sec:caches_locality_principles}.
After that the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}.
Replacement policies will be explained in \ref{sec:replacement_policies} and write policies in \ref{sec:write_policies}, followed by the considerations to make when it comes to virtual addressing in section \ref{sec:caches_virtual_addressing}.
The two most important heuristics that make this possible will be explained in Section \ref{sec:caches_locality_principles}.
After that, the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}.
Replacement policies will be explained in \ref{sec:replacement_policies} and write policies in \ref{sec:write_policies}, followed by the considerations to make when it comes to virtual addressing in Section \ref{sec:caches_virtual_addressing}.
Section \ref{sec:caches_coherency} gives a short introduction on cache coherency and snooping.
Finally, the advantage of non-blocking caches is the topic of section \ref{sec:caches_non_blocking_caches}.
Finally, the advantage of non-blocking caches is the topic of Section \ref{sec:caches_non_blocking_caches}.
\subsection{Locality Principles}
\label{sec:caches_locality_principles}
Access patterns of a typical application are not random.
They tend to repeat themselves in time or are located in the near surrounding of previous accesses.
They tend to repeat in time or are located in the near surrounding of previous accesses.
Those two heuristics are called \textit{temporal locality} and \textit{spatial locality}.
\subsubsection{Temporal Locality}
Temporal locality is the concept of referenced data being likely to be referenced again in the near future.
Taking advantage of this is the main idea behind a cache:
When new data is referenced, it will be read from the main memory and buffered in the cache.
The processor can now perform operations on this data and use its end result further without needing to access the main memory.
Taking advantage of this is the main idea behind a cache: when new data is referenced, it will be read from the main memory and buffered in the cache.
The processor can now perform operations on this data and use its end result without needing to access the main memory.
\subsubsection{Spatial Locality}
Programs have a tendency to reference data that is nearby in the memory space of already referenced data.
This tendency, spatial locality, arises because related data is often clustered together, for example in arrays or structures.
When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other.
Spatial locality can be exploited by organizing blocks of data in so called \textit{cache blocks} or \textit{cache lines} which are larger than a single data word.
Spatial locality can be exploited by organizing blocks of data in so called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word.
This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses.
An active form of exploiting spatial locality is the use of \textit{prefetching}.
@@ -45,9 +44,9 @@ Here, the program causes the cache to fetch more than one cache line from the un
\subsection{Logical Organization}
\label{sec:caches_logical_organization}
This section concerns the question where to store the fetched data in the cache.
This section revolves about the question where to store the retrieved data in the cache.
Because the cache is much smaller than the DRAM, only a subset of the memory can be held in the cache at a time.
Into which cache line a block of memory placed is determined by the \textit{placement policy}.
Into which cache line a block of memory is placed is determined by the \textit{placement policy}.
There are three main policies:
\begin{itemize}
@@ -55,19 +54,28 @@ There are three main policies:
In \textit{direct-mapped caches} the cache is divided into multiple sets with a single cache line in each set.
For every address there is only one cache line where the data can be placed in.
\item
A \textit{fully associative cache} there is only one large set, containing all available cache lines.
In a \textit{fully associative cache} there is only one large set, containing all available cache lines.
Referenced data has no restriction in which cache line it can be placed.
\item
\textit{Set-associative caches} are a hybrid form of the former two: There are multiple sets containing several cache lines each.
\textit{Set-associative caches} are a hybrid form of the former two: there are multiple sets containing several cache lines each.
The address determines the corresponding set, in that the data can be placed in any of the cache lines.
\end{itemize}
\input{img/thesis.tikzstyles}
\begin{figure}
\begin{center}
\tikzfig{img/associativity}
\caption{Four organizations for a cache of eight blocks \cite{Jacob2008}.}
\label{fig:associativity}
\end{center}
\end{figure}
Figure \ref{fig:associativity} illustrates four different organizations for a cache of eight cache lines.
In all three cases, the least significant portion of the physical address of the referenced data, the \textit{index}, determines the set in which the data is to store.
However, several entries in the DRAM map to the same set, so the remaining most significant portion of the address is used as a \textit{tag} and is stored next to the actual data in the cache line.
After an entry is fetched from the cache, the tag is used to determine if the entry actually corresponds to the referenced data.
An example subdivision of the address in the index, tag and byte offset is shown in figure \ref{fig:address_mapping}.
An example subdivision of the address in the index, tag and byte offset is shown in Figure \ref{fig:address_mapping}.
\input{img/thesis.tikzstyles}
\begin{figure}[!ht]
\begin{center}
\tikzfig{img/address}
@@ -77,11 +85,11 @@ An example subdivision of the address in the index, tag and byte offset is shown
\end{figure}
Directly-mapped caches have the advantage, that only one tag has to be compared with the address.
However, every time new data is referenced that gets placed into the same set, the cache line will be evicted.
This leads to an overall lower cache hit rate as the other two policies.
However, every time new data is referenced that is placed into the same set, the cache line needs to be evicted.
This leads to an overall lower cache hit rate compared to the other two policies.
In a fully associative cache, a memory reference can be placed anywhere, consequently all cache lines have to be fetched and compared to the tag.
Although this policy has the highest potential cache hit rate, the high space consumption due to comparators and high power consumption due to the lookup process, makes it non-feasible for many systems.
Although this policy has the highest potential cache hit rate, the area cost due to additional comparators and high power consumption due to the lookup process, makes it non-feasible for many systems.
The hybrid approach of set-associative caches offers a trade-off between both policies.
The term \textit{associativity} denotes the number of cache lines that are contained in a set.
@@ -90,14 +98,14 @@ The term \textit{associativity} denotes the number of cache lines that are conta
\label{sec:replacement_policies}
In case of contention, cache lines have to be evicted.
To determine which one of the corresponding set, there are several replacement policies:
To determine which cache line in the corresponding set is evicted, there are several replacement policies:
\begin{itemize}
\item
The random policy selects a cache line of a set at random.
\item
The \revabbr{least recently used}{LRU} policy selects the cache line whose last usage is the longest time ago.
A LRU algorithm is expensive to implement, a counter value for every cache line of a set has to be updated every time the set is accessed.
An LRU algorithm is expensive to implement, as a counter value for every cache line of a set has to be updated every time the set is accessed.
\item
An alternative is a \revabbr{pseudo LRU}{PLRU} policy, where an extra bit is set to 1 every time a cache line is accessed.
When the extra bit of every cache line in a set is set to 1, they will get reset to 0.
@@ -113,15 +121,14 @@ To determine which one of the corresponding set, there are several replacement p
\label{sec:write_policies}
To maintain consistency to the underlying memory subsystem, special care has to be taken when a write access occurs.
In case of a \textit{write-through} cache, the underlying memory is updated immediately, meaning the updated value will also directly be written in the DRAM.
In case of a \textit{write-through} cache, the underlying memory is updated immediately, meaning the updated value will also directly be written into the DRAM.
Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty.
To mitigate the problem, a write buffer can be used, which allows the processor to make further progress as the data is written.
To mitigate the problem, a write buffer can be used, which allows the processor to make further progress while the data is written.
An alternative is a so called \textit{write-back} cache.
Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted.
To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used:
It denotes if the value has been updated while it has been in the cache.
If it is the case, it has to be written back to ensure consistency, otherwise it is not needed.
To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used; it denotes if the value has been updated while it has been in the cache.
If this is the case, it must be written back to ensure consistency, otherwise it is not necessary.
Also here, a write buffer can be used to place the actual write back requests into a queue.
\subsection{Virtual Addressing}
@@ -143,16 +150,16 @@ Figure \ref{fig:virtual_address} shows an exemplary division of a virtual addres
Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number.
For conversions, so called \textit{page tables} are used to look up the physical page number.
Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause up to 4 memory accesses, which is expensive.
To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used that acts like a cache on its own for physical page numbers.
Page tables are usually multiple levels deep (e.g. 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive.
To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers.
However, as long as the physical address is not present, the data cache cannot lookup its entries as the index is not known yet.
So the cache has to wait on the TLB, or worse on multiple memory accesses.
To circumvent this problem, the cache can be indexed by the virtual address what makes it possible to parallelize both procedures.
Such a cache is called \textit{virtually indexed} and \textit{physically tagged} and is illustrated in figure \ref{fig:virtual_address_conversion}.
However, as long as the physical address is not present, the data cache cannot look up its entries as the index is not known yet.
So the cache has to wait for the TLB or even multiple memory accesses if the physical page number is not stored in it.
To circumvent this problem, the cache can be indexed by the virtual address, which makes it possible to parallelize both procedures.
Such a cache is called \textit{virtually indexed} and \textit{physically tagged} and is illustrated in Figure \ref{fig:virtual_address_conversion}.
% Ist die Darstellung aus dem Buch richtig? Sollte der Cache Index wirklich über den Page Offset hinaus gehen?
\begin{figure}[!ht]
\begin{figure}
\begin{center}
\tikzfig{img/virtual_address_conversion}
\caption[Virtually indexed, physically tagged cache\cite{Jacob2008}.]{Virtually indexed, physically tagged cache\cite{Jacob2008}. ASID refers to address-space identifier.}
@@ -160,42 +167,40 @@ Such a cache is called \textit{virtually indexed} and \textit{physically tagged}
\end{center}
\end{figure}
The result from the TLB, the physical page number, needs to be compared to tag that is stored in the cache.
The result from the TLB, which is the physical page number, needs to be compared to the tag that is stored in the cache.
When the tag and the physical page number match, then the cache entry is valid for this virtual address.
Note that when the cache index is completely contained in the page offset, another problem called \textit{aliasing} can be resolved, which will not further be discussed in this thesis.
\subsection{Cache Coherency}
\label{sec:caches_coherency}
In multi-core environments, caches become a distributed system.
As every core uses its own set of caches and possibly shares a cache at the last stage with the other processors, a new problem arises.
Should two or more cores operate on the same shared data, multiple copies of the data will be placed in the private caches and it must be guaranteed that all cores agree on the actual value the data has at any point in time.
Divergent perceptions of the same data are to be regarded as errors.
If two or more cores operate on the same shared data, multiple copies of the data will be placed in the private caches and it must be guaranteed that all cores agree on the actual value the data has at any point in time.
Different perceptions of the same data should be considered as errors.
Therefore, it is important to guarantee \textit{cache coherency}.
One of the solutions for cache coherency is the use of a so-called snooping protocol.
A cache will snoop the cache coherence bus to examine if it already has a copy of requested data.
Snooping packets then are used to update or invalidate other copies of the data.
Snooping protocols can be very complex and hard to formally verify that they in fact guarantee cache coherence.
For this reason, they will not further discussed in this thesis.
Snooping packets are then used to update or invalidate other copies of the data.
Snooping protocols are complex and difficult to formally verify that they in fact guarantee cache coherence.
For this reason, they are not further discussed in this thesis.
\subsection{Non-blocking Caches}
\subsection{Non-Blocking Caches}
\label{sec:caches_non_blocking_caches}
In blocking caches, cache misses require the processor to stall until the data is fetched from the underlying memory.
As this is a major slowdown, non-blocking caches try to solve this problem, making it possible for the processor to make further progress while waiting on the value.
Similarly to the write buffer, previously discussed in \ref{sec:write_policies}, a new buffer will be introduced: the \revabbr{miss status hold register}{MSHR}.
Similarly to the write buffer, previously discussed in Section \ref{sec:write_policies}, a new buffer will be introduced: the \revabbr{miss status hold register}{MSHR}.
The number of MSHRs correspond to the number of misses the cache can handle concurrently; when all available MSHRs are occupied and a further miss occurs, the cache will block.
An MSHR entry always corresponds to one cache line that is currently being fetched from the underlying memory subsystem.
There are two variants of cache misses:
\textit{Primary misses} are misses that lead to another occupation of an MSHR, where as \textit{secondary misses} are added to an existing MSHR entry and therefore cannot cause the cache to block.
There are two variants of cache misses: \textit{primary misses} are misses that lead to another occupation of an MSHR, where as \textit{secondary misses} are added to an existing MSHR entry and therefore cannot cause the cache to block.
This is the case when the same cache line as accessed.
An architecture of an MSHR file is illustrated in figure \ref{fig:mshr_file}.
An architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}.
\begin{figure}[!ht]
\begin{figure}
\begin{center}
\tikzfig{img/mshr_file}
\caption[Miss Holding Status Register File\cite{Jahre2007}.]{Miss Holding Status Register File\cite{Jahre2007}. V refers to a valid bit.}
@@ -203,4 +208,4 @@ An architecture of an MSHR file is illustrated in figure \ref{fig:mshr_file}.
\end{center}
\end{figure}
When the data for a cache miss is returned from the underlying memory, the cache will be updated, all targets of the MSHR entry will be served with the value and the MSHR entry will eventually get deallocated.
When the data for a cache miss is returned from the underlying memory, the cache will be updated, all targets of the MSHR entry will be served with the value and the MSHR entry will eventually become deallocated.