bachelor-thesis/inc/4.caches.tex

\section{Caches}
\label{sec:caches}

In this section, the necessity and functionality of caches in modern computing systems is explained as well as the required considerations resulting from virtual memory addressing.
A special focus is also placed on non-blocking caches.
The theory is based on the chapters \textit{``An Overview of Cache Principles''} and \textit{``Logical Organization''} of \cite{Jacob2008} and on \cite{Jahre2007}.

With the advancement of faster multi-core processors, the performance difference to the main memory is increasing, commonly referred to as the \textit{memory wall}.
Therefore, caches, whose goal is to decrease the latency and increase the bandwidth of a memory access, play an important role when it comes to the overall performance of computing systems.

Caches are faster than DRAM, but only provide a small capacity as the area cost is a lot higher.
For this reason, at least the \textit{working set}, the data that the currently running application is working on, should be stored in the cache to improve performance.

The two most important heuristics that make this possible will be explained in Section \ref{sec:caches_locality_principles}.
After that, the typical structure of a cache will be discussed in \ref{sec:caches_logical_organization}.
Replacement policies will be explained in \ref{sec:replacement_policies} and write policies in \ref{sec:write_policies}, followed by the considerations to make when it comes to virtual addressing in Section \ref{sec:caches_virtual_addressing}.
Section \ref{sec:caches_coherency} gives a short introduction on cache coherency and snooping.
Finally, the advantage of non-blocking caches is the topic of Section \ref{sec:caches_non_blocking_caches}.

\subsection{Locality Principles}
\label{sec:caches_locality_principles}

Access patterns of a typical application are not random.
They tend to repeat themselfes in time or are located in the near surrounding of previous accesses.
Those two heuristics are called \textit{temporal locality} and \textit{spatial locality}.

\subsubsection{Temporal Locality}

Temporal locality is the concept of referenced data being likely to be referenced again in the near future.
Taking advantage of this is the main idea behind a cache: when new data is referenced, it will be read from the main memory and buffered in the cache.
The processor can now perform operations on this data and use its end result without needing to access the main memory.

\subsubsection{Spatial Locality}

Programs have a tendency to reference data that is nearby in the memory space of already referenced data.
This tendency, spatial locality, arises because related data is often clustered together, for example in arrays or structures.
When calculations are performed on those arrays, sequential access patterns can be observed as one element is processed after the other.
Spatial locality can be exploited by organizing blocks of data in so-called \textit{cache blocks} or \textit{cache lines}, which are larger than a single data word.
This is a passive form of making use of spatial locality, as referenced data will also cause nearby words to be loaded into the same cache line, making them available for further accesses.

An active form of exploiting spatial locality is the use of \textit{prefetching}.
Here, the program causes the cache to fetch more than one cache line from the underlying memory system.

\subsection{Logical Organization}
\label{sec:caches_logical_organization}

This section revolves about the question where to store the retrieved data in the cache.
Because the cache is much smaller than the DRAM, only a subset of the memory can be held in the cache at a time.
The cache line into which a block is placed is determined by the \textit{placement policy}.
There are three main policies:

\begin{itemize}
 \item
    In \textit{direct-mapped caches} the cache is divided into multiple sets with a single cache line in each set.
    For each address there is only one cache line where the data can be placed in.
 \item
    In a \textit{fully associative cache} there is only one large set, containing all available cache lines.
    Referenced data has no restriction in which cache line it can be placed.
 \item
    \textit{Set-associative caches} are a hybrid form of the former two: there are multiple sets containing several cache lines each.
    The address determines the corresponding set, in that the data can be placed in any of the cache lines.
\end{itemize}

\input{img/thesis.tikzstyles}
\begin{figure}
\begin{center}
\tikzfig{img/associativity}
\caption[Four organizations for a cache of eight blocks.]{Four organizations for a cache of eight blocks \cite{Jacob2008}.}
\label{fig:associativity}
\end{center}
\end{figure}

Figure \ref{fig:associativity} illustrates four different organizations for a cache of eight cache lines.
As an example, a data block with the address \texttt{0x40} may be placed in the second set for the direct-mapped, two-way associative and four-way associative cache configurations.
However, in the latter two configurations, the cache can choose the horizontal placement of the block within the set.
For the fully associative cache, every cache line is a valid placement as it consists of only one set.

In each cache configuration, the least significant portion of the physical address of the referenced data, the \textit{index}, determines the set in which the data is to store.
However, several entries in the DRAM map to the same set, so the remaining most significant portion of the address is used as a \textit{tag} and is stored next to the actual data in the cache line.
After an entry is fetched from the cache, the tag is used to determine if the entry actually corresponds to the referenced data.
An exemplary subdivision of the address in the index, tag and byte offset is shown in Figure \ref{fig:address_mapping}.

\begin{figure}[!ht]
\begin{center}
\tikzfig{img/address}
\caption{Exemplary address mapping for the tag, index and byte offset.}
\label{fig:address_mapping}
\end{center}
\end{figure}

Direct-mapped caches have the advantage that only one tag has to be compared with the address.
However, every time new data is referenced that is placed into the same set, the cache line needs to be evicted.
This leads to an overall lower cache hit rate compared to the other two policies.

In a fully associative cache, a memory reference can be placed anywhere, consequently all cache lines have to be fetched and compared to the tag.
Although this policy has the highest potential cache hit rate, the area cost due to additional comparators and high power consumption due to the lookup process, make it non-feasible for many systems.

The hybrid approach of set-associative caches offers a trade-off between both policies.
The term \textit{associativity} denotes the number of cache lines that are contained in a set.

\subsection{Replacement Policies}
\label{sec:replacement_policies}

In case of contention, cache lines have to be evicted.
To determine which cache line in the corresponding set is evicted, there are several replacement policies:

\begin{itemize}
 \item
    The random policy selects a cache line of a set at random.
 \item
    The \revabbr{least recently used}{LRU} policy selects the cache line whose last usage is the longest time ago.
    An LRU algorithm is expensive to implement, as a counter value for every cache line of a set has to be updated every time the set is accessed.
 \item
    An alternative is a \revabbr{pseudo LRU}{PLRU} policy, where an extra bit is set to 1 every time a cache line is accessed.
    When the extra bit of every cache line in a set is set to 1, they are reset to 0.
    In case of contention, the first cache line whose extra bit is 0 is evicted, which indicates that the last usage was likely some time ago.
 \item
    In the \revabbr{least frequently used}{LFU} policy, every time a cache line is accessed, a counter value will be increased.
    The cache line with the lowest value, the least frequently used one, will be chosen to be evicted.
 \item
    The \revabbr{first in first out}{FIFO} policy evicts the cache lines in the same order they were placed.
\end{itemize}

\subsection{Write Policies}
\label{sec:write_policies}

To maintain consistency to the underlying memory subsystem, special care has to be taken when a write access occurs.
In case of a \textit{write-through} cache, the underlying memory is updated immediately, meaning the updated value will also directly be written into the DRAM.
Because the DRAM provides a significantly lower bandwidth than the cache, this comes at a performance penalty.
To mitigate the problem, a write buffer can be used, which allows the processor to make further progress while the data is written.

An alternative is a so-called \textit{write-back} cache.
Instead of writing the updated value immediately to the underlying memory, it will be written back when the corresponding cache line is evicted.
To identify if a cache line has to be written back, a so-called \textit{dirty-bit} is used; it denotes if the value has been updated while it has been in the cache.
If this is the case, it must be written back to ensure consistency, otherwise it is not necessary.
Also here, a write buffer can be used to place the actual write back requests into a queue.

\subsection{Virtual Addressing}
\label{sec:caches_virtual_addressing}

Operating systems use virtual addressing to isolate the memory spaces of user space programs from each other, giving each process an own virtual address space.

\textit{Virtual addresses} are composed of a \textit{virtual page number} and a \textit{page offset}.
The virtual page number is the actual part that is virtual, the page offset is the same for the virtual and the physical address.
Figure \ref{fig:virtual_address} shows an exemplary division of a virtual address into its components.

\begin{figure}[!ht]
\begin{center}
\tikzfig{img/virtual_address}
\caption{Exemplary division of the virtual address into a virtual page number and page offset.}
\label{fig:virtual_address}
\end{center}
\end{figure}

Before a process can access a specific region in memory, the kernel has to translate the virtual page number into a physical page number.
For conversions, so-called \textit{page tables} are used to look up the physical page number.
Page tables are usually multiple levels deep (e.g., 4-levels on x86), so a single conversion can cause a number of memory accesses, which is expensive.
To improve performance, a \revabbr{translation lookaside buffer}{TLB} is used, which acts like a cache on its own for physical page numbers.

However, as long as the physical address is not present, the data cache cannot look up its entries as the index is not known yet.
So the cache has to wait for the TLB or even multiple memory accesses in case the physical page number is not stored in it.
To circumvent this problem, the cache can be indexed by the virtual address, which makes it possible to parallelize both procedures.
Such a cache is called \textit{virtually indexed} and \textit{physically tagged} and is illustrated in Figure \ref{fig:virtual_address_conversion}.

% Ist die Darstellung aus dem Buch richtig? Sollte der Cache Index wirklich über den Page Offset hinaus gehen?
\begin{figure}
\begin{center}
\tikzfig{img/virtual_address_conversion}
\caption[Virtually indexed, physically tagged cache.]{Virtually indexed, physically tagged cache \cite{Jacob2008}. ASID refers to address-space identifier.}
\label{fig:virtual_address_conversion}
\end{center}
\end{figure}

The result from the TLB, which is the physical page number, needs to be compared to the tag that is stored in the cache.
When the tag and the physical page number match, then the cache entry is valid for this virtual address.

\subsection{Cache Coherency}
\label{sec:caches_coherency}

In multi-core environments, caches become a distributed system.
As every core uses its own set of caches and possibly shares a cache at the last stage with the other processors, a new problem arises.
If two or more cores operate on the same shared data, multiple copies of the data will be placed in the private caches and it must be guaranteed that all cores agree on the actual value the data has at any point in time.
Different perceptions of the same data should be considered as errors.

Therefore, it is important to guarantee \textit{cache coherency}.
One of the solutions for cache coherency is the use of a so-called snooping protocol.
A cache will snoop the cache coherence bus to examine if it already has a copy of requested data.
Snooping packets are then used to update or invalidate other copies of the data.
Snooping protocols are complex and it is difficult to formally verify that they in fact guarantee cache coherence.
For this reason, they are not further discussed in this thesis.

\subsection{Non-Blocking Caches}
\label{sec:caches_non_blocking_caches}

In blocking caches, cache misses require the processor to stall until the data is fetched from the underlying memory.
As this is a major slowdown, non-blocking caches try to solve this problem, making it possible for the processor to make further progress while waiting on the value.

Similarly to the write buffer, previously discussed in Section \ref{sec:write_policies}, a new buffer will be introduced: the \revabbr{miss status hold register}{MSHR}.
The number of MSHRs correspond to the number of misses the cache can handle concurrently; when all available MSHRs are occupied and a further miss occurs, the cache will block.
An MSHR entry always corresponds to one cache line that is currently being fetched from the underlying memory subsystem.

There are two variants of cache misses: \textit{primary misses} are misses that lead to another occupation of an MSHR, where as \textit{secondary misses} are added to an existing MSHR entry and therefore cannot cause the cache to block.
This is the case when the same cache line as accessed.

A possible architecture of an MSHR file is illustrated in Figure \ref{fig:mshr_file}.

\begin{figure}
\begin{center}
\tikzfig{img/mshr_file}
\caption[Miss Status Holding Register File.]{Miss Status Holding Register File \cite{Jahre2007}. \textit{V} refers to a valid bit.}
\label{fig:mshr_file}
\end{center}
\end{figure}

When the data for a cache miss is returned from the underlying memory, the cache will be updated, all targets of the MSHR entry will be served with the value and the MSHR entry will eventually become deallocated.