Start of kernel

This commit is contained in:
2024-02-15 21:09:14 +01:00
parent 1e993eeb28
commit df8ef883b3
7 changed files with 54 additions and 7 deletions

View File

@@ -292,3 +292,11 @@
short = DSB, short = DSB,
long = Data Synchronization Barrier, long = Data Synchronization Barrier,
} }
\DeclareAcronym{tlb}{
short = TLB,
long = translation lookaside buffer,
}
\DeclareAcronym{mmu}{
short = MMU,
long = memory management unit,
}

View File

@@ -51,7 +51,7 @@ Besides the data bus, the channel consists also of the \textit{command bus} and
Over the command bus, the commands necessary to control memory are issued by the \textit{memory controller}, that sits in between the \ac{dram} and the \ac{mpsoc}. Over the command bus, the commands necessary to control memory are issued by the \textit{memory controller}, that sits in between the \ac{dram} and the \ac{mpsoc}.
For example, to read data, the memory controller may first issue a \ac{pre} command to precharge the bitlines in a certain bank, followed by an \iac{act} command to load the contents of a row into the \acp{psa}, and finally a \ac{rd} command to move the data from the \acp{psa} to the \acp{ssa} where it can further be exposed to the data bus. For example, to read data, the memory controller may first issue a \ac{pre} command to precharge the bitlines in a certain bank, followed by an \iac{act} command to load the contents of a row into the \acp{psa}, and finally a \ac{rd} command to move the data from the \acp{psa} to the \acp{ssa} where it can further be exposed to the data bus.
The value on the address bus determines the row, column, bank and rank used during the respective commands, while it is the responsibility of the memory controller to translate the \ac{mpsoc}-side address to the respective components in a process called \ac{am}. The value on the address bus determines the row, column, bank and rank used during the respective commands, while it is the responsibility of the memory controller to translate the \ac{mpsoc}-side address to the respective components in a process called \ac{am}.
\Ac{am} ensures that the number of \textit{row misses}, i.e., the need for precharging and activating another row, is minimized. The \ac{am} ensures that the number of \textit{row misses}, i.e., the need for precharging and activating another row, is minimized.
% One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which maps the lower address bits to the columns, followed by the ranks and banks, and the highest bits to the rows. % One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which maps the lower address bits to the columns, followed by the ranks and banks, and the highest bits to the rows.
One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in \cref{img:bank_interleaving}. One particularly common \ac{am} scheme is called \textit{Bank Interleaving} \cite{jung2017a}, which is illustrated using an exemplary mapping in \cref{img:bank_interleaving}.
Under the assumption of a sequentially increasing address access pattern, this scheme maps the lowest bits of an address to the column bits of a row to exploit the already activated row as much as possible. Under the assumption of a sequentially increasing address access pattern, this scheme maps the lowest bits of an address to the column bits of a row to exploit the already activated row as much as possible.

View File

@@ -1,10 +1,48 @@
\subsection{Application Kernel} \subsection{Application Kernel}
\label{sec:kernel} \label{sec:kernel}
With both the \aca{fimdram} model in DRAMSys and the software support library, it is now possible to write an application that runs on gem5 and leverages \ac{pim} to accelerate workloads.
When it comes to gem5, there are three different approaches to model a system:
\begin{itemize}
\item
Run the user-space application in \textbf{system call emulation} mode.
In this mode, the application is simulated in isolation, while forwarding system calls to the host operating system.
This mode has the lowest level of accuracy because many components of the memory system are implemented using a very simplified model, such as page table walking and the \ac{tlb}.
\item
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
This mode is the most accurate, as it closely resembles a real deployment of an application.
It also provides a complete enough environment to develop device drivers, without the need for a real system.
\item
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver with its user space interface.
\end{itemize}
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application, but due to the complexity overhead and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor of the bare-metal option.
Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}.
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
% python config % python config
% bare metal vs linux
\subsubsection{Boot Code}
% linker script % linker script
% start assembly script % start assembly script
\subsubsection{Cache Management}
% ARM page tables % ARM page tables
% cache management % cache management
\subsubsection{Bare-Metal Utilities}
% Heap Allocator (linked list allocator?...)
% uart
\subsubsection{Memory Configuration}
% address mapping
% konkrete zahlen zu mcconfig
\subsubsection{GEMV Microkernel}
% heap allocation % heap allocation
\subsubsection{Benchmark Environment}
% m5ops

View File

@@ -11,13 +11,13 @@ Such a \ac{pim} library must include the following essential features to fully i
\item It should provide data structures to assemble \textbf{microkernels} and functions to transfer the microkernels to the \acp{crf} of the processing units. \item It should provide data structures to assemble \textbf{microkernels} and functions to transfer the microkernels to the \acp{crf} of the processing units.
\item To meet the \textbf{memory layout} requirements of the inputs and outputs of an algorithm, it should provide data structures to represent vectors and matrices according to the special layout constraints. \item To meet the \textbf{memory layout} requirements of the inputs and outputs of an algorithm, it should provide data structures to represent vectors and matrices according to the special layout constraints.
\item After switching the mode to \ac{abp}, the library should provide functionality to \textbf{execute a user-defined microkernel} by issuing the necessary memory requests through the execution of \ac{ld} and \ac{st} instructions. \item After switching the mode to \ac{abp}, the library should provide functionality to \textbf{execute a user-defined microkernel} by issuing the necessary memory requests through the execution of \ac{ld} and \ac{st} instructions.
\item For platforms, where it is not possible to mark the \ac{pim} memory region as uncacheable, the library should provide the necessary \textbf{cache management} operations to bypass the cache filtering and to generate the right amount of \ac{rd} and \ac{wr} \ac{dram} commands. \item For platforms, where it is not possible to mark the \ac{pim} memory region as non-cacheable, the library should provide the necessary \textbf{cache management} operations to bypass the cache filtering and to generate the right amount of \ac{rd} and \ac{wr} \ac{dram} commands.
\end{itemize} \end{itemize}
As already discussed in \cref{sec:vm}, for simplicity and debugability reasons, the host processor communicates with the \ac{pim} model in the \ac{dram} using a \ac{json}-based protocol. As already discussed in \cref{sec:vm}, for simplicity and debugability reasons, the host processor communicates with the \ac{pim} model in the \ac{dram} using a \ac{json}-based protocol.
To achieve this, a small shared library, that defines the communication data structures as well as routines to serialize and deserialize them, is linked by both the \ac{pim} support library as well as the \ac{pim} model in DRAMSys. To achieve this, a small shared library, that defines the communication data structures as well as routines to serialize and deserialize them, is linked by both the \ac{pim} support library as well as the \ac{pim} model in DRAMSys.
A predefined memory region is then used to differentiate these communication messages from regular the regular memory traffic. A predefined memory region is then used to differentiate these communication messages from regular the regular memory traffic.
Ideally, this memory region is also set as uncacheable, so that the messages do not get stuck in the on-chip cache. Ideally, this memory region is also set as non-cacheable, so that the messages do not get stuck in the on-chip cache.
Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region. Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region.
With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units. With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units.

View File

@@ -301,7 +301,7 @@ This memory layout is explained in detail in \cref{sec:memory_layout}.
\subsubsection{Programming Model} \subsubsection{Programming Model}
The software stack of \aca{fimdram} is split into three main parts. The software stack of \aca{fimdram} is split into three main parts.
Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as uncacheable. Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{hbm} memory and setting these regions as non-cacheable.
It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller. It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller.
Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel. Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel.
Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user. Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user.

View File

@@ -1,4 +1,4 @@
\section{Virtual Prototypes and System-Level Modeling} \section{System-Level Modeling}
\label{sec:vp} \label{sec:vp}
To evaluate the impact of \ac{pim} on the performance and power consumption of various applications, it is essential to perform simulations. To evaluate the impact of \ac{pim} on the performance and power consumption of various applications, it is essential to perform simulations.
@@ -6,6 +6,7 @@ Such simulations allow investigating critical factors such as the \ac{pim} micro
It even may allow for the identification of potential improvements to the \ac{pim} architecture. It even may allow for the identification of potential improvements to the \ac{pim} architecture.
In addition, the suitability of different applications for \ac{pim} can be evaluated, as well as the influence of the specific memory layout requirements on the application software. In addition, the suitability of different applications for \ac{pim} can be evaluated, as well as the influence of the specific memory layout requirements on the application software.
\subsection{Virtual Prototypes}
To perform such simulations, it is necessary to use a simulation model, commonly referred to as a \ac{vp}. To perform such simulations, it is necessary to use a simulation model, commonly referred to as a \ac{vp}.
\Acp{vp} act as executable software models of a physical hardware system, allowing the architecture of the system to be completely simulated in software. \Acp{vp} act as executable software models of a physical hardware system, allowing the architecture of the system to be completely simulated in software.
This in turn enables the software development and the identification of potential platform-specific software bugs without the need for the actual hardware implementation \cite{antonino2018}. This in turn enables the software development and the identification of potential platform-specific software bugs without the need for the actual hardware implementation \cite{antonino2018}.

View File

@@ -22,7 +22,7 @@
@article{arm2015, @article{arm2015,
title = {{{ARM Cortex-A Series Programmer}}s {{Guide}} for {{ARMv8-A}}}, title = {{{ARM Cortex-A Series Programmer}}s {{Guide}} for {{ARMv8-A}}},
author = {{ARM}}, author = {{ARM}},
date = {2015}, date = {2015-03-24},
langid = {english}, langid = {english},
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmers Guide for ARMv8-A.pdf} file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmers Guide for ARMv8-A.pdf}
} }