Fixes in kernel chapter and conclusion
This commit is contained in:
@@ -8,14 +8,14 @@ A working \ac{vp} of \aca{fimdram}, in the form of a software model, was develop
|
|||||||
It was found that, ... (TODO: hier Ergebnisse).
|
It was found that, ... (TODO: hier Ergebnisse).
|
||||||
|
|
||||||
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work.
|
However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work.
|
||||||
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode switching and transferring of the microkernels.
|
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, used to implement the mode switching and transferring of the microkernels.
|
||||||
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
|
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
|
||||||
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
|
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
|
||||||
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
|
In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications.
|
||||||
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
|
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
|
||||||
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$.
|
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kilo\byte}$.
|
||||||
|
|
||||||
For a better evaluation of the performance gains of \aca{fimdram}, it should be compared with real-world \ac{dnn} applications.
|
For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications.
|
||||||
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.
|
Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment.
|
||||||
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications.
|
Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications.
|
||||||
|
|
||||||
|
|||||||
@@ -11,11 +11,11 @@ This mode has the lowest level of accuracy because many components of the memory
|
|||||||
\item
|
\item
|
||||||
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
|
Simulate the entire system in \textbf{full system} mode, booting a full Linux kernel and running the application to be benchmarked as a user space program.
|
||||||
This mode is the most accurate, as it closely resembles the real deployment of an application.
|
This mode is the most accurate, as it closely resembles the real deployment of an application.
|
||||||
It also provides a complete enough environment to develop device drivers, without the need for the real system.
|
It also provides a complete enough environment to develop device drivers, without the need for the real hardware.
|
||||||
\item
|
\item
|
||||||
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
|
Finally, run gem5 in full system mode, but boot a custom kernel in a \textbf{bare-metal} environment.
|
||||||
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
|
This approach is the most flexible, as the user has full control over the hardware configuration as well as the operating system.
|
||||||
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver with its user space interface.
|
The user application does not have to run in user space, but can run in a privileged mode, making it easy to implement low-level routines without having to write a device driver and its user space interface.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
|
While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions.
|
||||||
@@ -25,34 +25,35 @@ Here, the self-written kernel has full control over the complete system which is
|
|||||||
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
|
On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable.
|
||||||
|
|
||||||
Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
|
Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them.
|
||||||
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
|
Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library.
|
||||||
This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
|
This standard library provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail.
|
||||||
|
It also greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model.
|
||||||
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
|
However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads.
|
||||||
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e., the board only boots the provided operating system kernel.
|
Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e., the board only boots the provided kernel file.
|
||||||
|
|
||||||
\subsubsection{Boot Code}
|
\subsubsection{Boot Code}
|
||||||
At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application.
|
At startup on an ARM processor, the reset handler cannot directly dispatch to the \texttt{main} function in the application.
|
||||||
Instead, certain initialization steps are required, such as setting the stack pointer and, equally important, enabling the on-chip caches by setting up the page tables and enabling the \ac{mmu}.
|
Instead, certain initialization steps are required, such as setting up the stack pointer and, equally important, enabling the on-chip caches by initializing the page tables and enabling the \ac{mmu}.
|
||||||
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary bare-metal setup steps for an ARMv8 processor in the AArch64 execution mode and provides useful examples of the boot code that require only minimal modification.
|
Fortunately, ARM provides a comprehensive document \cite{gao2017} that explains all the necessary setup steps for a bare-metal ARMv8 processor in the AArch64 execution mode and provides useful examples of the necessary boot code, which require only minimal modification.
|
||||||
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
|
While executing the boot code however, the processor cannot correctly access the \ac{dram} yet, as the \ac{mmu} is not set up.
|
||||||
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
|
To solve this problem, the ARM board of gem5 provides a small boot memory component, often implemented as \ac{eflash} in real systems, where the boot code instructions can be fetched from and that supports the native access width of the processor.
|
||||||
During the initialization phase, gem5 ensures that the boot code \texttt{.init} section is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
|
During the initialization phase, gem5 ensures that the boot code in the \texttt{.init} section of the executable is copied into the boot memory, as instructed by the header of the \ac{elf} file, generated by the linker script.
|
||||||
|
|
||||||
The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
|
The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region.
|
||||||
Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions:
|
Furthermore, it reserves space for the stack on the \ac{dram} and sets up two special \aca{fimdram} regions:
|
||||||
Firstly, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
|
Firstly, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel.
|
||||||
Secondly, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on.
|
Secondly, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed in for the processing units to operate on.
|
||||||
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables.
|
This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or load and store stack variables.
|
||||||
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.
|
Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other.
|
||||||
|
|
||||||
\subsubsection{Cache Management}
|
\subsubsection{Cache Management}
|
||||||
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then will be used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
|
In order to enable the on-chip caches and therefore be able to use the \ac{dram}, the page tables have to be set up, which are then used by the \ac{mmu} to map addresses between the virtual memory space and the physical memory space.
|
||||||
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
|
To simplify the virtual-physical translation, the \ac{dram} address space should only be mapped as a block at a certain offset in the virtual address space.
|
||||||
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
|
In the attributes of the page table, each mapped block of address space can be assigned a special cache policy, such as cacheable and non-cacheable.
|
||||||
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
|
While most of the \ac{dram} area are should be a normal, cacheable memory region, the \ac{pim} region should be marked as a non-cacheable memory for reasons explained in \cref{sec:microkernel_execution}.
|
||||||
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
|
Furthermore, special memory-mapped devices such as the \ac{uart}, which is used to print logging messages to the \ac{stdout}, must be marked as a non-cacheable device region, as otherwise the log messages may get held in the cache and not be written until the cache line is eventually flushed.
|
||||||
|
|
||||||
In the AArch64 execution mode, the operating system can choose from three different granule sized for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
|
In the AArch64 execution mode, the operating system can choose from three different granule sizes for the translation tables: $\qty{4}{\kilo\byte}$, $\qty{16}{\kilo\byte}$ and $\qty{64}{\kilo\byte}$.
|
||||||
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
|
Each granule size has a different maximum amount of page table nesting, with up to a 4-level look-up for the $\qty{4}{\kilo\byte}$ configuration, as shown in \cref{img:pagetable_granule}.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
@@ -65,30 +66,30 @@ Each granule size has a different maximum amount of page table nesting, with up
|
|||||||
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
|
As it can be seen, when using the complete 4-level page lookup process, nine bits of the virtual address are used per level to index into the corresponding page table.
|
||||||
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
|
In cases where the input address is restricted to a maximum of 42 bits, the level 0 table can be omitted and translation can start with the level 1 table.
|
||||||
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
|
In each table, an entry either points to the physical address of the next level page table, or alternatively can directly point to the base address of a memory block, completing the address translation prematurely.
|
||||||
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup process for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
|
While regular operating systems may use the complete $\qty{4}{\kilo\byte}$ lookup procedure for maximum flexibility, in the controlled bare-metal case, where there is only one application, this may not be necessary.
|
||||||
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
|
For this reason, the developed kernel makes use of the first level page table and maps the complete \ac{dram} memory region using the $\qty{1}{\giga\byte}$ memory blocks.
|
||||||
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
|
In addition to the base pointer, each entry in the page table also holds certain attributes on how the memory region should be treated.
|
||||||
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
|
To enable the mapping of the boot memory and \ac{io} devices such as \ac{uart}, the first memory blocks are marked with a non-cacheable attribute, followed by the normal \ac{dram} region, which is cacheable, and finally the \aca{fimdram} region, which is set to non-cacheable again.
|
||||||
|
|
||||||
After setting up the page tables, setting the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and initializing the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
|
After setting up the page tables, initializing the \ac{tcr} to enable the $\qty{4}{\kilo\byte}$, and assigning the \ac{ttbr}, which holds the base pointer to the first level page table, the \ac{mmu} can be enabled, and the boot code can finally dispatch to the \texttt{main} function of the application.
|
||||||
|
|
||||||
\subsubsection{Bare-Metal Utilities}
|
\subsubsection{Bare-Metal Utilities}
|
||||||
% Heap Allocator (linked list allocator?...)
|
% Heap Allocator (linked list allocator?...)
|
||||||
|
|
||||||
When running an application in a bare-metal environment, the standard library of the programming language supports only very limited features and does not provide the \ac{io} and memory management routines that one expects when running an application on top of an operating system.
|
In a bare-metal environment, the standard library of the programming language offers only minimal features and lacks the \ac{io} and memory management functions typically provided when running an application on top of an operating system.
|
||||||
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
|
For example, it is not possible to use \ac{io} functions such as \texttt{printf} to print log messages to \ac{stdout}.
|
||||||
Instead, the kernel itself must define what it interprets as \ac{stdout} and redirect the formatted strings to the custom implementation.
|
Instead, the kernel must explicitly define the interpretation of \ac{stdout} and redirect the formatted strings to the custom implementation.
|
||||||
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
|
In the ARM board of gem5, a \ac{uart} device is mapped by default into the memory map, where the kernel can write messages to.
|
||||||
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
|
The \ac{uart} device model in gem5 then redirects the written messages either to an output file on the host machine or to a \ac{tcp} port, where a client can then redirect the written content to the \ac{stdout} of the host.
|
||||||
|
|
||||||
Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
|
Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
|
||||||
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
|
During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
|
||||||
Firstly, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region.
|
Firstly, as the \ac{pim} arrays become very large with high matrix dimensions and may not fit in the preallocated stack region.
|
||||||
Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
|
Secondly, and most importantly, because the stack resides in the normal cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
|
||||||
As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}.
|
As an alternative, it would be possible to preallocate all \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file and mapping that section to the \ac{pim}-enabled \acp{pch}.
|
||||||
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
|
However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
|
||||||
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays during at runtime.
|
To solve this problem, a custom, commonly available memory allocator, based on \ac{llff}, has been used to be able to allocate dynamically sized \ac{pim} arrays at runtime.
|
||||||
In order to incorporate this memory allocator, it has been initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
|
In order to incorporate this memory allocator, it was initialized by providing a preallocated memory arena, which is mapped to the \ac{pim} region of the \ac{dram}.
|
||||||
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
|
The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
|
||||||
|
|
||||||
\subsubsection{Memory Configuration}
|
\subsubsection{Memory Configuration}
|
||||||
@@ -96,7 +97,7 @@ The allocator can then dynamically use sections of this arena to allocate the \a
|
|||||||
As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}.
|
As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}.
|
||||||
These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus.
|
These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus.
|
||||||
|
|
||||||
In \aca{hbm}, the burst size of a memory access is exactly $\qty{32}{\byte}$, which therefore defines the lowest five bits of any valid memory address:
|
In \aca{hbm}, the burst size of a memory access is exactly $\qty{32}{\byte}$, which therefore defines the lowest five bits of any valid memory address.
|
||||||
Resulting from $log_2(32)=5$, the first five bits of an address must be zero, since this is the smallest granularity with which the \ac{dram} can be accessed.
|
Resulting from $log_2(32)=5$, the first five bits of an address must be zero, since this is the smallest granularity with which the \ac{dram} can be accessed.
|
||||||
The next highest bits should already switch between the different memory banks, as these are coupled with the different processing units.
|
The next highest bits should already switch between the different memory banks, as these are coupled with the different processing units.
|
||||||
Following from the 16-wide \ac{fp16} vectors, one of which is $\qty{32}{\byte}$ in size, and the column-major matrix format, subsequent vectors in the linear address space should be spread across all banks so that the processing units can concurrently perform the \ac{mac} operation.
|
Following from the 16-wide \ac{fp16} vectors, one of which is $\qty{32}{\byte}$ in size, and the column-major matrix format, subsequent vectors in the linear address space should be spread across all banks so that the processing units can concurrently perform the \ac{mac} operation.
|
||||||
@@ -194,7 +195,7 @@ The gem5 simulator reports this number of ticks and other statistics in a file a
|
|||||||
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
|
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
|
||||||
This can be achieved using the so-called M5ops.
|
This can be achieved using the so-called M5ops.
|
||||||
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
|
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
|
||||||
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
|
Another option is to generate special memory accesses at predefined addresses, which the processor then interprets in a certain way.
|
||||||
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
|
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
|
||||||
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
|
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
|
||||||
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e., the core, is measured, not the initialization.
|
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e., the core, is measured, not the initialization.
|
||||||
|
|||||||
@@ -147,6 +147,6 @@ Finally, another memory barrier must synchronize the memory operations, otherwis
|
|||||||
During the development of this cache management approach, it became apparent that the cache may not be sufficiently controllable by the user program.
|
During the development of this cache management approach, it became apparent that the cache may not be sufficiently controllable by the user program.
|
||||||
The compiler may introduce additional stack variables and memory accesses that are not obvious to the developer, rendering the explicit generation of \ac{rd} and \ac{wr} commands nearly impossible.
|
The compiler may introduce additional stack variables and memory accesses that are not obvious to the developer, rendering the explicit generation of \ac{rd} and \ac{wr} commands nearly impossible.
|
||||||
Therefore, these critical sections would have to be written in an assembly language to have the necessary control over the processor.
|
Therefore, these critical sections would have to be written in an assembly language to have the necessary control over the processor.
|
||||||
However, other user programs running in the background at the same time, would also make this approach very difficult.
|
However, other user programs running in the background at the same time, would also interfere with the cache in an uncontrollable manner, making this approach very difficult.
|
||||||
|
|
||||||
With providing these utility routines for executing the \ac{pim} microkernel, all tools are now available to build an application that makes proper use of \aca{fimdram} for accelerating \ac{dnn} applications.
|
With providing these utility routines for executing the \ac{pim} microkernel, all tools are now available to build an application that makes proper use of \aca{fimdram} for accelerating \ac{dnn} applications.
|
||||||
|
|||||||
Reference in New Issue
Block a user