Start of kernel implementation

2024-02-16 15:20:28 +01:00
parent df8ef883b3
commit c04c3fa829
7 changed files with 95 additions and 11 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -300,3 +300,31 @@
    short = MMU,
    long = memory management unit,
 }
+\DeclareAcronym{eflash}{
+    short = eFlash,
+    long = embedded flash,
+}
+\DeclareAcronym{elf}{
+    short = ELF,
+    long = Executable and Linkable Format,
+}
+\DeclareAcronym{uart}{
+    short = UART,
+    long = Universal Asynchronous Receiver-Transmitter,
+}
+\DeclareAcronym{stdout}{
+    short = stdout,
+    long = standard output,
+}
+\DeclareAcronym{tcr}{
+    short = TCR,
+    long = Translation Control Register,
+}
+\DeclareAcronym{ttbr}{
+    short = TTBR,
+    long = Translation Table Base Register,
+}
+\DeclareAcronym{tcp}{
+    short = TCP,
+    long = Transmission Control Protocol,
+}
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -1,7 +1,7 @@
 \subsection{Application Kernel}
 \label{sec:kernel}

-With both the \aca{fimdram} model in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
+With both the \aca{fimdram} model implemented in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
 When it comes to gem5, there are three different approaches to model a system:
 \begin{itemize}
 \item
@@ -10,8 +10,8 @@ In this mode, the application is simulated in isolation, while forwarding system
 This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}.
 \item
 Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
-This mode is the most accurate, as it closely resembles a real deployment of an application.
-It also provides a complete enough environment to develop device drivers, without the need for a real system.
+This mode is the most accurate, as it closely resembles the real deployment of an application.
+It also provides a complete enough environment to develop device drivers, without the need for the real system.
 \item
 Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
 This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
@@ -19,23 +19,77 @@ The user application does not have to run in user space, but can run in a privil
 \end{itemize}

 While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
-The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application, but due to the complexity overhead and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor of the bare-metal option.
+The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application.
+However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option.
 Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}.
 On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.

-% python config
+Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
+Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
+This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
+However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
+Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel.

 \subsubsection{Boot Code}
-% linker script
-% start assembly script
+At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
+Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
+Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
+While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
+To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
+During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
+
+The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
+Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
+First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
+Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
+This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
+Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.

 \subsubsection{Cache Management}
-% ARM page tables
-% cache management
+In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
+To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
+In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
+While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
+Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
+
+In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
+Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
+
+\begin{figure}
+	\centering
+	\includegraphics[width=\linewidth]{images/pagetable_granule}
+	\caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.}
+	\label{img:pagetable_granule}
+\end{figure}
+
+As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
+In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
+In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
+While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
+For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
+In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
+To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
+
+After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.

 \subsubsection{Bare-Metal Utilities}
 % Heap Allocator (linked list allocator?...)
-% uart
+
+When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
+For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
+Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
+In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
+The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
+
+Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
+During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
+First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
+Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
+As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
+However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
+To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
+In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
+The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.

 \subsubsection{Memory Configuration}
 % address mapping
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -111,6 +111,7 @@ With the introduced data structures used for addition, scalar multiplication and
 The implementation of the \aca{fimdram} execution model is explained in the following section.

 \subsubsection{Microkernel Execution}
+\label{sec:microkernel_execution}

 The host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions.
 When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand.
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -200,7 +200,7 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}.

 \begin{table}
 	\centering
-	\includegraphics[width=0.9\linewidth]{images/isa}
+	\includegraphics[width=\linewidth]{images/isa}
 	\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
 	\label{tab:isa}
 \end{table}
--- a/src/doc.bib
+++ b/src/doc.bib
@@ -23,6 +23,7 @@
  title = {{{ARM Cortex-A Series Programmer}}’s {{Guide}} for {{ARMv8-A}}},
  author = {{ARM}},
  date = {2015-03-24},
+  url = {https://developer.arm.com/documentation/den0024/latest/},
  langid = {english},
  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf}
 }
--- a/src/images/isa.pdf
+++ b/src/images/isa.pdf
--- a/src/images/pagetable_granule.pdf
+++ b/src/images/pagetable_granule.pdf