Start of kernel implementation
This commit is contained in:
@@ -300,3 +300,31 @@
|
||||
short = MMU,
|
||||
long = memory management unit,
|
||||
}
|
||||
\DeclareAcronym{eflash}{
|
||||
short = eFlash,
|
||||
long = embedded flash,
|
||||
}
|
||||
\DeclareAcronym{elf}{
|
||||
short = ELF,
|
||||
long = Executable and Linkable Format,
|
||||
}
|
||||
\DeclareAcronym{uart}{
|
||||
short = UART,
|
||||
long = Universal Asynchronous Receiver-Transmitter,
|
||||
}
|
||||
\DeclareAcronym{stdout}{
|
||||
short = stdout,
|
||||
long = standard output,
|
||||
}
|
||||
\DeclareAcronym{tcr}{
|
||||
short = TCR,
|
||||
long = Translation Control Register,
|
||||
}
|
||||
\DeclareAcronym{ttbr}{
|
||||
short = TTBR,
|
||||
long = Translation Table Base Register,
|
||||
}
|
||||
\DeclareAcronym{tcp}{
|
||||
short = TCP,
|
||||
long = Transmission Control Protocol,
|
||||
}
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
\subsection{Application Kernel}
|
||||
\label{sec:kernel}
|
||||
|
||||
With both the \aca{fimdram} model in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
|
||||
With both the \aca{fimdram} model implemented in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
|
||||
When it comes to gem5, there are three different approaches to model a system:
|
||||
\begin{itemize}
|
||||
\item
|
||||
@@ -10,8 +10,8 @@ In this mode, the application is simulated in isolation, while forwarding system
|
||||
This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}.
|
||||
\item
|
||||
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
|
||||
This mode is the most accurate, as it closely resembles a real deployment of an application.
|
||||
It also provides a complete enough environment to develop device drivers, without the need for a real system.
|
||||
This mode is the most accurate, as it closely resembles the real deployment of an application.
|
||||
It also provides a complete enough environment to develop device drivers, without the need for the real system.
|
||||
\item
|
||||
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
|
||||
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
|
||||
@@ -19,23 +19,77 @@ The user application does not have to run in user space, but can run in a privil
|
||||
\end{itemize}
|
||||
|
||||
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
|
||||
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application, but due to the complexity overhead and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor of the bare-metal option.
|
||||
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application.
|
||||
However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option.
|
||||
Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}.
|
||||
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
|
||||
|
||||
% python config
|
||||
Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
|
||||
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
|
||||
This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
|
||||
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
|
||||
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel.
|
||||
|
||||
\subsubsection{Boot Code}
|
||||
% linker script
|
||||
% start assembly script
|
||||
At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
|
||||
Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
|
||||
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
|
||||
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
|
||||
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
|
||||
During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
|
||||
|
||||
The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
|
||||
Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
|
||||
First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
|
||||
Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
|
||||
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
|
||||
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.
|
||||
|
||||
\subsubsection{Cache Management}
|
||||
% ARM page tables
|
||||
% cache management
|
||||
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
|
||||
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
|
||||
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
|
||||
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
|
||||
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
|
||||
|
||||
In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
|
||||
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=\linewidth]{images/pagetable_granule}
|
||||
\caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.}
|
||||
\label{img:pagetable_granule}
|
||||
\end{figure}
|
||||
|
||||
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
|
||||
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
|
||||
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
|
||||
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
|
||||
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
|
||||
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
|
||||
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
|
||||
|
||||
After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
|
||||
|
||||
\subsubsection{Bare-Metal Utilities}
|
||||
% Heap Allocator (linked list allocator?...)
|
||||
% uart
|
||||
|
||||
When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
|
||||
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
|
||||
Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
|
||||
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
|
||||
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
|
||||
|
||||
Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
|
||||
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
|
||||
First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
|
||||
Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
|
||||
As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
|
||||
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
|
||||
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
|
||||
In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
|
||||
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
|
||||
|
||||
\subsubsection{Memory Configuration}
|
||||
% address mapping
|
||||
|
||||
@@ -111,6 +111,7 @@ With the introduced data structures used for addition, scalar multiplication and
|
||||
The implementation of the \aca{fimdram} execution model is explained in the following section.
|
||||
|
||||
\subsubsection{Microkernel Execution}
|
||||
\label{sec:microkernel_execution}
|
||||
|
||||
The host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions.
|
||||
When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand.
|
||||
|
||||
@@ -200,7 +200,7 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
\includegraphics[width=0.9\linewidth]{images/isa}
|
||||
\includegraphics[width=\linewidth]{images/isa}
|
||||
\caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.}
|
||||
\label{tab:isa}
|
||||
\end{table}
|
||||
|
||||
@@ -23,6 +23,7 @@
|
||||
title = {{{ARM Cortex-A Series Programmer}}’s {{Guide}} for {{ARMv8-A}}},
|
||||
author = {{ARM}},
|
||||
date = {2015-03-24},
|
||||
url = {https://developer.arm.com/documentation/den0024/latest/},
|
||||
langid = {english},
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf}
|
||||
}
|
||||
|
||||
Binary file not shown.
BIN
src/images/pagetable_granule.pdf
Normal file
BIN
src/images/pagetable_granule.pdf
Normal file
Binary file not shown.
Reference in New Issue
Block a user