master-thesis/src/chapters/implementation/kernel.tex

\subsection{Application Kernel}
\label{sec:kernel}

With both the \aca{fimdram} model implemented in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
When it comes to gem5, there are three different approaches to model a system:
\begin{itemize}
\item
Run the user-space application in \textbf{system call emulation} mode.
In this mode, the application is simulated in isolation, while forwarding system calls to the host operating system.
This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}.
\item
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
This mode is the most accurate, as it closely resembles the real deployment of an application.
It also provides a complete enough environment to develop device drivers, without the need for the real system.
\item
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver with its user space interface.
\end{itemize}

While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application.
However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option.
Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}.
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.

Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel.

\subsubsection{Boot Code}
At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.

The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.

\subsubsection{Cache Management}
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.

In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.

\begin{figure}
	\centering
	\includegraphics[width=\linewidth]{images/pagetable_granule}
	\caption[The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule]{The distinct page table levels for the $\qty{4}{\kilo\byte}$ granule \cite{arm2015}.}
	\label{img:pagetable_granule}
\end{figure}

As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.

After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.

\subsubsection{Bare-Metal Utilities}
% Heap Allocator (linked list allocator?...)

When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.

Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.

\subsubsection{Memory Configuration}

As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}.
These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus.

In \aca{hbm}, the burst size of a memory access is exactly $\qty{32}{\byte}$, which therefore defines the lowest five bits of any valid memory address:
Resulting from $log_2(32)=5$, the first five bits of an address must be zero, since this is the smallest granularity with which the \ac{dram} can be accessed.
The next highest bits should already switch between the different memory banks, as these are coupled with the different processing units.
Following from the 16-wide \ac{fp16} vectors, one of which is $\qty{32}{\byte}$ in size, and the column-major matrix format, subsequent vectors in the linear address space should be spread across all banks so that the processing units can concurrently perform the \ac{mac} operation.
As a result, the \ac{am} is structured in such a way that the lowest bits of an address are mapped to a portion of the column bits, followed by all the various bank bits.
These are then followed by the remaining column bits and, finally, the row bits.
The simplified \ac{am} following this scheme is shown in \cref{img:hbm2_am}.

\begin{figure}
	\centering
	\begin{bytefield}[bitwidth=4mm,bitheight=5mm]{31}
		\bitheader[endianness=big]{0,2,3,4,5,9,10,14,15,30} \\
		\bitbox{16}{Row}
		\bitbox{5}{Column}
		\bitbox{5}{Bank}
		\bitbox{2}{C}
		\bitbox{3}[bgcolor=verylightgray]{}
	\end{bytefield}

	\caption[Simplified \aca{hbm} address mapping with a split column mapping]{Simplified \aca{hbm} address mapping with a split column mapping.}
	\label{img:hbm2_am}
\end{figure}

In addition to the \ac{am}, the \aca{hbm} system can be configured in terms of stack count, stack height, bank grouping, and memory array dimensions.
The concrete values for these parameters are listed in \cref{tab:memspec}.

\begin{table}
	\centering
	% \resizebox{\linewidth}{!}{%
    \begin{tblr}{
        hlines,
        vlines,
        cell{2}{3} = {r},
        cell{3}{3} = {r},
        cell{4}{3} = {r},
        cell{5}{3} = {r},
        cell{6}{3} = {r},
        cell{7}{3} = {r},
        cell{8}{3} = {r},
        hline{2} = {-}{solid,black},
        hline{2} = {2}{-}{solid,black},
    }
    Parameter              & Description               & Value \\
    Number of Bank Groups  & Bank Groups per \ac{pch}  & 4     \\
    Number of Banks        & Banks per \ac{pch}        & 16    \\
    Number of \acp{pch}    & \acp{pch} per Channel     & 2     \\
    Number of Channels     & Total Number of Channels  & 1     \\
    Number of Columns      & Columns per Memory Array  & 128   \\
    Number of Rows         & Rows per Memory Array     & 65536 \\
    Width                  & Width of the Data Bus     & 64
    \end{tblr}
 %    }
	\caption[A list of the used configuration parameters of \aca{hbm}]{A list of the used configuration parameters of \aca{hbm}.}
	\label{tab:memspec}
\end{table}

As only one channel is simulated, the simulation does not take into account other memory stacks or memory dies in a stack.
Since different channels would only be used to increase the dimensions of the matrices further than it is done in this thesis, and the channels are completely independent of each other, this does not change the timing behavior of the simulation.

\subsubsection{GEMV Microkernel}

With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses.
With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.

\begin{listing}
\begin{verbatim}
MOV GRF_A #0, BANK
MOV GRF_A #1, BANK
MOV GRF_A #2, BANK
MOV GRF_A #3, BANK
MOV GRF_A #4, BANK
MOV GRF_A #5, BANK
MOV GRF_A #6, BANK
MOV GRF_A #7, BANK
MAC(AAM) GRF_B, BANK, GRF_A
JUMP -1, 7
FILL BANK, GRF_B #0
EXIT
\end{verbatim}
	\caption[A complete \ac{gemv} microkernel]{A complete \ac{gemv} microkernel.}
	\label{lst:gemv_microkernel}
\end{listing}

First, the input vector is loaded into all eight \ac{grf}-A registers, followed by the \ac{mac} core, which iteratively multiplies chunks of a matrix row with the input vector chunks and stores them in the first \ac{grf}-B register.
Then, the FILL instruction writes the computed partial sum into the memory bank, followed by an EXIT instruction that resets the processing units to a defined state.
Note that even though the microkernel consists of only 12 instructions, the host processor has to send in total 36 memory requests to the memory.
On the one hand because of the JUMP instruction, which is not executed itself, but repeats the previous instruction 7 times, and on the other hand because the memory requests have to be sent to both \ac{pch} which effectively executes the microkernel twice.
The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, load the partial sum vector from memory, reduce it, and possibly prepare it for the next \ac{dnn} layer in the same way as the input vector was prepared.

\subsubsection{Benchmark Environment}

One crucial missing piece for measuring the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
The gem5 simulator reports this number of ticks and other statistics in a file at the end of the simulation.
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
This can be achieved using the so-called M5ops.
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization.
This provides a fair basis for comparison and allows a number of comparative simulations to be performed.