Fixes in kernel chapter and conclusion

This commit is contained in:
2024-02-17 16:18:19 +01:00
parent cf0b8c3984
commit e597305a58
3 changed files with 29 additions and 28 deletions

View File

@@ -8,14 +8,14 @@ A working \ac{vp} of \aca{fimdram}, in the form of a software model, was develop
It was found that, ... (TODO: hier Ergebnisse).
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work.
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode switching and transferring of the microkernels.
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, used to implement the mode switching and transferring of the microkernels.
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$.
For a better evaluation of the performance gains of \aca{fimdram}, it should be compared with real-world \ac{dnn} applications.
For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications.
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications.

View File

@@ -11,11 +11,11 @@ This mode has the lowest level of accuracy because many components of the memory
\item
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
This mode is the most accurate, as it closely resembles the real deployment of an application.
It also provides a complete enough environment to develop device drivers, without the need for the real system.
It also provides a complete enough environment to develop device drivers, without the need for the real hardware.
\item
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver with its user space interface.
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver and its user space interface.
\end{itemize}
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
@@ -25,34 +25,35 @@ Here, the self-written kernel has full control over the complete system which is
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library.
This standard library provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
It also greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e., the board only boots the provided operating system kernel.
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e., the board only boots the provided kernel file.
\subsubsection{Boot Code}
At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
At startup on an ARM processor, the reset handler cannot directly dispatch to the \texttt{main} function in the application.
Instead, certain initialization steps are required, such as setting up the stack pointer and, equally important, enabling the on-chip caches by initializing the page tables and enabling the \ac{mmu}.
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary setup steps for a bare-metal ARMv8 processor in the AArch64 execution mode and provides useful examples of the necessary boot code, which require only minimal modification.
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
During the initialization phase, gem5 ensures that the boot code in the \texttt{.init} section of the executable is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
Furthermore, it reserves space for the stack on the \ac{dram} and sets up two special \aca{fimdram} regions:
Firstly, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
Secondly, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
Secondly, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed in for the processing units to operate on.
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or load and store stack variables.
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.
\subsubsection{Cache Management}
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
In the AArch64 execution mode, the operating system can choose from three different granule sizes for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
\begin{figure}
@@ -65,30 +66,30 @@ Each granule size has a different maximum amount of page table nesting, with up
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
After setting up the page tables, initializing the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and assigning the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
\subsubsection{Bare-Metal Utilities}
% Heap Allocator (linked list allocator?...)
When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
In a bare-metal environment, the standard library of the programming language offers only minimal features and lacks the \ac{io} and memory management functions typically provided when running an application on top of an operating system.
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
Instead, the kernel must explicitly define the interpretation of \ac{stdout} and redirect the formatted strings to the custom implementation.
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
Firstly, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
Firstly, as the \ac{pim} arrays become very large with high matrix dimensions and may not fit in the preallocated stack region.
Secondly, and most importantly, because the stack resides in the normal cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
As an alternative, it would be possible to preallocate all \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file and mapping that section to the \ac{pim}-enabled \acp{pch}.
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays at runtime.
In order to incorporate this memory allocator, it was initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
\subsubsection{Memory Configuration}
@@ -96,7 +97,7 @@ The allocator can then dynamically use sections of this arena to allocate the \a
As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}.
These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus.
In \aca{hbm}, the burst size of a memory access is exactly $\qty{32}{\byte}$, which therefore defines the lowest five bits of any valid memory address:
In \aca{hbm}, the burst size of a memory access is exactly $\qty{32}{\byte}$, which therefore defines the lowest five bits of any valid memory address.
Resulting from $log_2(32)=5$, the first five bits of an address must be zero, since this is the smallest granularity with which the \ac{dram} can be accessed.
The next highest bits should already switch between the different memory banks, as these are coupled with the different processing units.
Following from the 16-wide \ac{fp16} vectors, one of which is $\qty{32}{\byte}$ in size, and the column-major matrix format, subsequent vectors in the linear address space should be spread across all banks so that the processing units can concurrently perform the \ac{mac} operation.
@@ -194,7 +195,7 @@ The gem5 simulator reports this number of ticks and other statistics in a file a
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
This can be achieved using the so-called M5ops.
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
Another option is to generate special memory accesses at predefined addresses, which the processor then interprets in a certain way.
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e., the core, is measured, not the initialization.

View File

@@ -147,6 +147,6 @@ Finally, another memory barrier must synchronize the memory operations, otherwis
During the development of this cache management approach, it became apparent that the cache may not be sufficiently controllable by the user program.
The compiler may introduce additional stack variables and memory accesses that are not obvious to the developer, rendering the explicit generation of \ac{rd} and \ac{wr} commands nearly impossible.
Therefore, these critical sections would have to be written in an assembly language to have the necessary control over the processor.
However, other user programs running in the background at the same time, would also make this approach very difficult.
However, other user programs running in the background at the same time, would also interfere with the cache in an uncontrollable manner, making this approach very difficult.
With providing these utility routines for executing the \ac{pim} microkernel, all tools are now available to build an application that makes proper use of \aca{fimdram} for accelerating \ac{dnn} applications.