diff --git a/src/acronyms.tex b/src/acronyms.tex index d194ff4..68512cd 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -300,3 +300,31 @@ short = MMU, long = memory management unit, } +\DeclareAcronym{eflash}{ + short = eFlash, + long = embedded flash, +} +\DeclareAcronym{elf}{ + short = ELF, + long = Executable and Linkable Format, +} +\DeclareAcronym{uart}{ + short = UART, + long = Universal Asynchronous Receiver-Transmitter, +} +\DeclareAcronym{stdout}{ + short = stdout, + long = standard output, +} +\DeclareAcronym{tcr}{ + short = TCR, + long = Translation Control Register, +} +\DeclareAcronym{ttbr}{ + short = TTBR, + long = Translation Table Base Register, +} +\DeclareAcronym{tcp}{ + short = TCP, + long = Transmission Control Protocol, +} diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex index 7d4a616..3593ea9 100644 --- a/src/chapters/implementation/kernel.tex +++ b/src/chapters/implementation/kernel.tex @@ -1,7 +1,7 @@ \subsection{Application Kernel} \label{sec:kernel} -With both the \aca{fimdram} model in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads. +With both the \aca{fimdram} model implemented in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads. When it comes to gem5, there are three different approaches to model a system: \begin{itemize} \item @@ -10,8 +10,8 @@ In this mode, the application is simulated in isolation, while forwarding system This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}. \item Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program. -This mode is the most accurate, as it closely resembles a real deployment of an application. -It also provides a complete enough environment to develop device drivers, without the need for a real system. +This mode is the most accurate, as it closely resembles the real deployment of an application. +It also provides a complete enough environment to develop device drivers, without the need for the real system. \item Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment. This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system. @@ -19,23 +19,77 @@ The user application does not have to run in user space, but can run in a privil \end{itemize} While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions. -The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application, but due to the complexity overhead and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor of the bare-metal option. +The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application. +However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option. Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}. On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable. -% python config +Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them. +Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail. +This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model. +However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads. +Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel. \subsubsection{Boot Code} -% linker script -% start assembly script +At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application. +Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}. +Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification. +While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up. +To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor. +During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script. + +The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region. +Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions: +First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel. +Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on. +This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables. +Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other. \subsubsection{Cache Management} -% ARM page tables -% cache management +In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space. +To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space. +In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable. +While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}. +Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed. + +In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$. +Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}. + +\begin{figure} + \centering + \includegraphics[width=\linewidth]{images/pagetable_granule} + \caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.} + \label{img:pagetable_granule} +\end{figure} + +As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table. +In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table. +In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely. +While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary. +For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks. +In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated. +To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again. + +After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application. \subsubsection{Bare-Metal Utilities} % Heap Allocator (linked list allocator?...) -% uart + +When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system. +For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}. +Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation. +In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to. +The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host. + +Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it. +During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons: +First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region. +Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures. +As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}. +However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time. +To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime. +In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}. +The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures. \subsubsection{Memory Configuration} % address mapping diff --git a/src/chapters/implementation/library.tex b/src/chapters/implementation/library.tex index 274320d..162129e 100644 --- a/src/chapters/implementation/library.tex +++ b/src/chapters/implementation/library.tex @@ -111,6 +111,7 @@ With the introduced data structures used for addition, scalar multiplication and The implementation of the \aca{fimdram} execution model is explained in the following section. \subsubsection{Microkernel Execution} +\label{sec:microkernel_execution} The host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions. When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand. diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index c23819c..7b1141d 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -200,7 +200,7 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}. \begin{table} \centering - \includegraphics[width=0.9\linewidth]{images/isa} + \includegraphics[width=\linewidth]{images/isa} \caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.} \label{tab:isa} \end{table} diff --git a/src/doc.bib b/src/doc.bib index d302793..13677fd 100644 --- a/src/doc.bib +++ b/src/doc.bib @@ -23,6 +23,7 @@ title = {{{ARM Cortex-A Series Programmer}}’s {{Guide}} for {{ARMv8-A}}}, author = {{ARM}}, date = {2015-03-24}, + url = {https://developer.arm.com/documentation/den0024/latest/}, langid = {english}, file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf} } diff --git a/src/images/isa.pdf b/src/images/isa.pdf index e0bb2a7..05bd092 100644 Binary files a/src/images/isa.pdf and b/src/images/isa.pdf differ diff --git a/src/images/pagetable_granule.pdf b/src/images/pagetable_granule.pdf new file mode 100644 index 0000000..9cab7f0 Binary files /dev/null and b/src/images/pagetable_granule.pdf differ