Start of kernel implementation

This commit is contained in:
2024-02-16 15:20:28 +01:00
parent df8ef883b3
commit c04c3fa829
7 changed files with 95 additions and 11 deletions

View File

@@ -300,3 +300,31 @@
short = MMU,
long = memory management unit,
}
\DeclareAcronym{eflash}{
short = eFlash,
long = embedded flash,
}
\DeclareAcronym{elf}{
short = ELF,
long = Executable and Linkable Format,
}
\DeclareAcronym{uart}{
short = UART,
long = Universal Asynchronous Receiver-Transmitter,
}
\DeclareAcronym{stdout}{
short = stdout,
long = standard output,
}
\DeclareAcronym{tcr}{
short = TCR,
long = Translation Control Register,
}
\DeclareAcronym{ttbr}{
short = TTBR,
long = Translation Table Base Register,
}
\DeclareAcronym{tcp}{
short = TCP,
long = Transmission Control Protocol,
}

View File

@@ -1,7 +1,7 @@
\subsection{Application Kernel}
\label{sec:kernel}
With both the \aca{fimdram} model in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
With both the \aca{fimdram} model implemented in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
When it comes to gem5, there are three different approaches to model a system:
\begin{itemize}
\item
@@ -10,8 +10,8 @@ In this mode, the application is simulated in isolation, while forwarding system
This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}.
\item
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
This mode is the most accurate, as it closely resembles a real deployment of an application.
It also provides a complete enough environment to develop device drivers, without the need for a real system.
This mode is the most accurate, as it closely resembles the real deployment of an application.
It also provides a complete enough environment to develop device drivers, without the need for the real system.
\item
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
@@ -19,23 +19,77 @@ The user application does not have to run in user space, but can run in a privil
\end{itemize}
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application, but due to the complexity overhead and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor of the bare-metal option.
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application.
However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option.
Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}.
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
% python config
Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel.
\subsubsection{Boot Code}
% linker script
% start assembly script
At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.
\subsubsection{Cache Management}
% ARM page tables
% cache management
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
\begin{figure}
\centering
\includegraphics[width=\linewidth]{images/pagetable_granule}
\caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.}
\label{img:pagetable_granule}
\end{figure}
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
\subsubsection{Bare-Metal Utilities}
% Heap Allocator (linked list allocator?...)
% uart
When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
\subsubsection{Memory Configuration}
% address mapping

View File

@@ -111,6 +111,7 @@ With the introduced data structures used for addition, scalar multiplication and
The implementation of the \aca{fimdram} execution model is explained in the following section.
\subsubsection{Microkernel Execution}
\label{sec:microkernel_execution}
The host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions.
When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand.

View File

@@ -200,7 +200,7 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}.
\begin{table}
\centering
\includegraphics[width=0.9\linewidth]{images/isa}
\includegraphics[width=\linewidth]{images/isa}
\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
\label{tab:isa}
\end{table}

View File

@@ -23,6 +23,7 @@
title = {{{ARM Cortex-A Series Programmer}}s {{Guide}} for {{ARMv8-A}}},
author = {{ARM}},
date = {2015-03-24},
url = {https://developer.arm.com/documentation/den0024/latest/},
langid = {english},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmers Guide for ARMv8-A.pdf}
}

Binary file not shown.

Binary file not shown.