diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex
index 3ac8cc4..a4c99a4 100644
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -17,7 +17,7 @@ Memory arrays, in turn, are composed of multiple \acp{subarray}.
 The \ac{lwl} is connected to the transistor's gate, switching it on and off, while the \ac{lbl} is used to access the stored value.
 Global \acp{mwl} and \acp{mbl} span over all \acp{subarray}, forming complete \textit{rows} and \textit{columns} of a memory array.
 
-Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, basic structure of which is illustrated in Figure \ref{img:psa}.
+Because the charge stored in each cell is very small, so-called \acp{psa} are needed to amplify the voltage of each cell while it is being connected to the shared \ac{lbl} \cite{jacob2008}, basic structure of which is illustrated in \cref{img:psa}.
 
 \begin{figure}
 	\centering
@@ -34,7 +34,7 @@ The process of loading the stored values into the \acp{psa} is done for all colu
 Once a row is activated, it can be read from or written to with a certain access granularity determined by the \ac{bl} of the memory.
 To perform such a burst access, the \acp{csl} of a set of \acp{psa} must be enabled, connecting them to the more powerful \acp{ssa} that drive the actual bank \ac{io}.
 Depending on the \ac{we} signal, the \acp{ssa} either sense and amplify the logic value of the \acp{psa}, or they overwrite it using the \textit{write drivers}.
-The Figure \ref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far.
+The \cref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far.
 
 \begin{figure}
 	\centering
@@ -53,7 +53,7 @@ For example, to read data, the memory controller may first issue a \ac{pre} comm
 The value on the address bus determines the row, column, bank and rank used during the respective commands, while it is the responsibility of the memory controller to translate the \ac{mpsoc}-side address to the respective components in a process called \ac{am}.
 \Ac{am} ensures that the number of \textit{row misses}, i.e., the need for precharging and activating another row, is minimized.
 % One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which maps the lower address bits to the columns, followed by the ranks and banks, and the highest bits to the rows.
-One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in Figure \ref{img:bank_interleaving}.
+One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in \cref{img:bank_interleaving}.
 Under the assumption of a sequentially increasing address access pattern, this scheme maps the lowest bits of an address to the column bits of a row to exploit the already activated row as much as possible.
 After that, instead of addressing the next row of the current bank directly, the mapping switches to another bank to take advantage of \textit{bank parallelism}.
 Because banks can be controlled independently, one bank can be outputting the next data burst while another is concurrently precharging or activating a new row.
@@ -98,7 +98,7 @@ Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which w
 
 \Aca{hbm} is a \ac{dram} standard that has been defined by \ac{jedec} in 2016 as a successor of the previous \ac{hbm} standard \cite{jedec2015a}.
 What differentiates \ac{hbm} from other types of memory is its \ac{sip} approach.
-Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many layers and a buffer die at the bottom, as shown in Figure \ref{img:sip}.
+Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many layers and a buffer die at the bottom, as shown in \cref{img:sip}.
 \begin{figure}
 	\centering
 	\includegraphics[width=0.8\linewidth]{images/sip}
@@ -116,7 +116,7 @@ A single data transfer is performed with either a \ac{bl} of 2 or 4, depending o
 In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
 Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
 
-Figure \ref{img:hbm} illustrates the internal architecture of a single memory die.
+\cref{img:hbm} illustrates the internal architecture of a single memory die.
 It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}.
 In the center of the die, the \acp{tsv} connect to the next die above or the previous die below.
 
diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex
index 43135d1..964f084 100644
--- a/src/chapters/introduction.tex
+++ b/src/chapters/introduction.tex
@@ -47,9 +47,9 @@ This has led researchers to reconsider past \ac{pim} architectures and advance t
 
 This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors.
 The remainder of this work is structured as follows:
-Section \ref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \ac{hbm}.
-In section \ref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
-Section \ref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
-After explaining the necessary prerequisites, section \ref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
-The section \ref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural-network inference.
-Finally, section \ref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures. 
+\cref{sec:dram} gives a brief overview of the architecture of \acp{dram}, in detail that of \ac{hbm}.
+In \cref{sec:pim} various types of \ac{pim} architectures are presented, with some concrete examples discussed in detail.
+\cref{sec:vp} is an introduction to virtual prototyping and system-level hardware simulation.
+After explaining the necessary prerequisites, \cref{sec:implementation} implements a concrete \ac{pim} architecture in software and provides a development library that applications can use to take advantage of in-memory processing.
+The \cref{sec:results} demonstrates the possible performance enhancement of \ac{pim} by simulating a typical neural-network inference.
+Finally, \cref{sec:conclusion} concludes the findings and identifies future improvements in \ac{pim} architectures. 
diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex
index d16727e..0c44570 100644
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -14,14 +14,14 @@ Finally, a number of concrete examples are presented.
 \subsection{Applicable Workloads}
 \label{sec:pim_workloads}
 
-As already discussed in Section \ref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
+As already discussed in \cref{sec:introduction}, \ac{pim} is a good fit for accelerating memory-bound workloads with low operational intensity.
 In contrast, compute-bound workloads tend to have high data reuse and can make excessive use of the on-chip cache and therefore do not need to utilize the full memory bandwidth.
 For problems like this, \ac{pim} is only of limited use.
 
 Many layers of modern \acp{dnn} can be expressed as a matrix-vector multiplication.
 The layer inputs can be represented as a vector and the model weights can be viewed as a matrix, where the number of columns is equal to the size of the input vector and the number of rows is equal to the size of the output vector.
 Pairwise multiplication of the input vector and a row of the matrix can be used to calculate an entry of the output vector.
-This process is illustrated in Figure \ref{img:dnn} where one \ac{dnn} layer is processed.
+This process is illustrated in \cref{img:dnn} where one \ac{dnn} layer is processed.
 
 \begin{figure}
 	\centering
@@ -113,7 +113,7 @@ To make full use of the output buffering, the matrix rows are interleaved in an
 	\label{img:hynix}
 \end{figure}
 
-As illustrated in Figure \ref{img:hynix}, a matrix row is distributed across all banks and partitioned into separate chunks, filling the complete \ac{dram} row.
+As illustrated in \cref{img:hynix}, a matrix row is distributed across all banks and partitioned into separate chunks, filling the complete \ac{dram} row.
 This is to ensure that the input vector is fully used and never refetched - all matrix rows of a corresponding chunk are multiplied by the input vector chunk before moving to the next chunk.
 If this is done repeatedly, the temporary results will be accumulated in the output vector.
 Since all the banks are operating on the same input vector at the same time, a single Newton \ac{dram} command will perform the arithmetic operations for all the banks in the memory.
@@ -127,17 +127,17 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a
 
 One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}.
 As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail.
-The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the subsection \ref{sec:memory_layout} being mainly based on \cite{kang2022}.
+The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
 
 \subsubsection{Architecture}
 As the name of \ac{hbm}-\ac{pim} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
 A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
 Consequently, mode switching is required for \ac{hbm}-\ac{pim}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
-Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
+Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region.
 
 At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}.
 They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
-This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
+This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
 
 \begin{figure}
 	\centering
@@ -180,7 +180,7 @@ Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of
 The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
 Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
 It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
-This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.
+This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.
 
 \begin{figure}
 	\centering
@@ -189,13 +189,13 @@ This processing unit architecture is illustrated in Figure \ref{img:pcu}, along
 	\label{img:pcu}
 \end{figure}
 
-Unlike SK Hynix's Newton architecture, \ac{hbm}-\ac{pim} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
-This makes \ac{hbm}-\ac{pim} less effective for very small workloads as the overhead is significant.
+To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \ac{hbm}-\ac{pim} requires both mode switching and loading a microkernel into the processing units before a workload can be executed.
+This makes \ac{hbm}-\ac{pim} less effective for very small workloads, as the overhead of the mode switching and initialization is significant.
 
 \subsubsection{Instruction Set}
 
 The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions.
-The data layout of these three instruction groups is shown in Table \ref{tab:isa}.
+The data layout of these three instruction groups is shown in \cref{tab:isa}.
 
 \begin{table}
 	\centering
@@ -211,9 +211,9 @@ Finally, the MOV and FILL instructions are used to move data between the memory
 
 The DST and SRC fields specify the operand type.
 That is, the register file or bank affected by the operation.
-Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols.
+Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols.
 The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location.
-Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}.
+Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}.
 
 \begin{table}
 	\centering
@@ -235,11 +235,11 @@ Another special field \textit{A} enabled the \ac{aam}, which will be explained i
 	Arithmetic & MAC     & multiply-accumulate                               & GRF-B        & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF \\
 	Arithmetic & MAD     & multiply-and-add                                  & GRF          & GRF, BANK      & GRF, BANK, SRF & GRF, BANK, SRF  
 	\end{tblr}}
-	\caption[A list of the supported instructions their possible sources and destinations]{A list of the supported instructions their possible sources and destinations \cite{shin-haengkang2023}.}
+	\caption[A list of all supported \ac{pim} instructions and their possible sources and destinations]{A list of all supported \ac{pim} instructions and their possible sources and destinations \cite{shin-haengkang2023}.}
 	\label{tab:instruction_set}
 \end{table}
 
-The Table \ref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
+The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
 It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly.
 For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
 Likewise, reading from the banks, requires a \ac{rd} command.
@@ -252,7 +252,7 @@ The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instr
 Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
 When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
 This means that the order of the respective memory accesses for such instructions must not be reordered, as it must match the corresponding instruction in the microkernel.
-For example, as shown in Listing \ref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for the actual memory access to get the row and column address of the bank access.
+For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for the actual memory access to get the row and column address of the bank access.
 
 \begin{listing}
 \begin{verbatim}
@@ -270,7 +270,7 @@ However, this comes at a significant performance cost and results in memory band
 Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.
 
 To solve this overhead, Samsung has implemented the \ac{aam} mode for arithmetic instructions.
-In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in Figure \ref{img:aam}.
+In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}.
 With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
 
 \begin{figure}
@@ -281,7 +281,7 @@ With this method, the register indices and the bank address cannot get out of sy
 \end{figure}
 
 As a side effect, this method also allows looping of an instruction in the microkernel, as otherwise the indices are always fixed and would therefore apply to the same register entry each time.
-At the core of a \ac{gemv} microkernel is an iterative \ac{mac} instruction, followed by a JUMP instruction that executes the \ac{mac} operation a total of eight times, as shown in Listing \ref{lst:gemv}.
+At the core of a \ac{gemv} microkernel is an iterative \ac{mac} instruction, followed by a JUMP instruction that executes the \ac{mac} operation a total of eight times, as shown in \cref{lst:gemv}.
 
 \begin{listing}
 \begin{verbatim}
@@ -295,7 +295,7 @@ JUMP -1, 7
 Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks.
 The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again.
 To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern.
-This memory layout is explained in detail in Section \ref{sec:memory_layout}.
+This memory layout is explained in detail in \cref{sec:memory_layout}.
  
 \subsubsection{Programming Model}
 
@@ -306,7 +306,7 @@ Alternatively, it would be possible to control cache behavior by issuing flush a
 Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.
 At the highest level, \ac{hbm}-\ac{pim} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation.
 
-The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in section \ref{sec:instruction_ordering}.
+The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}.
 Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously.
 The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size.
 Such a group of software threads is called a thread group.
@@ -315,7 +315,7 @@ Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hb
 \subsubsection{Memory Layout}
 \label{sec:memory_layout}
 
-As already described in Section \ref{sec:instruction_ordering}, the use of the \ac{aam} mode requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
+As already described in \cref{sec:instruction_ordering}, the use of the \ac{aam} mode requires a special memory layout so that the register indices are correctly calculated from the column and row addresses of a memory access.
 To make use of all eight \ac{grf}-A registers, the input address has to increment linearly, resulting in a row-major matrix layout.
 In a row-major matrix layout, the entries of a row are stored sequentially before switching to the next row, according to the \texttt{MATRIX[R][C]} \ac{c}-like array notation.
 
@@ -331,7 +331,7 @@ Since a vector is essentially a single-column matrix, it is always laid out sequ
 However, since all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
 As there is no communication between the banks, every bank needs to have its own copy of the input vector.
 Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
-This interleaving is illustrated in Figure \ref{img:input_vector}.
+This interleaving is illustrated in \cref{img:input_vector}.
 
 \begin{figure}
 	\centering
@@ -353,7 +353,7 @@ psum[i,0:15]=\sum_{j=0}^{8}(a[j*16:j*16+15]*w[i,j*16:j*16+15])
 The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$.
 This reduction step is mandatory because there is no means in the \ac{hbm}-\ac{pim} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}.
 In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory.
-The operation of this concrete \ac{gemv} microkernel is illustrated in Figure \ref{img:memory_layout}.
+The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}.
 
 \begin{figure}
 	\centering
@@ -362,10 +362,10 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in Figure \r
 	\label{img:memory_layout}
 \end{figure}
 
-In the Figure \ref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
+In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
 This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
 To increase the number of rows in the matrix, simply additional iterations of this 8-cycle microkernel are required, while feeding in the other memory addresses for the subsequent matrix rows.
-As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in Listing \ref{lst:gemv64}.
+As a side effect of the incremented bank address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8*8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
 
 \begin{listing}
 \begin{verbatim}
@@ -391,5 +391,5 @@ The power consumption of the \ac{hbm}-\ac{pim} dies itself is with $\qty{5.4}{\p
 However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
 
 In conclusion, \ac{hbm}-\ac{pim} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.
-The following Section \ref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.
+The following \cref{sec:vp} introduces the concept of virtual prototyping, which is the basis for the following implementation of the \ac{hbm}-\ac{pim} model in a simulator.
 
diff --git a/src/chapters/vp.tex b/src/chapters/vp.tex
index bce1783..f87a09c 100644
--- a/src/chapters/vp.tex
+++ b/src/chapters/vp.tex
@@ -1,5 +1,5 @@
 \section{Virtual Prototypes and System-Level Modeling}
 \label{sec:vp}
 
-% DRAMSys
-% also gem5
+\subsection{The gem5 Simulator}
+\subsection{DRAMSys}
diff --git a/src/index.tex b/src/index.tex
index 7192061..f4cf165 100644
--- a/src/index.tex
+++ b/src/index.tex
@@ -14,11 +14,12 @@
 \usepackage{subfig}
 \usepackage{url}
 \usepackage[hidelinks]{hyperref}
+\usepackage[nameinlink,capitalize,noabbrev]{cleveref}
 \usepackage{acro}
 \usepackage{lipsum}
 \usepackage{siunitx}
 \usepackage{url}
-\usepackage[urldate=long]{biblatex}
+\usepackage[urldate=long,sorting=none]{biblatex}
 \usepackage{pgfplots}
 \usepackage{bytefield}
 \usepackage{mathdots}