Kernel chapter complete

This commit is contained in:
2024-02-16 19:38:11 +01:00
parent 5055b0eb8a
commit db35568157
6 changed files with 24 additions and 14 deletions

View File

@@ -158,9 +158,9 @@ Since different channels would only be used to increase the dimensions of the ma
\subsubsection{GEMV Microkernel}
With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
The native matrix dimensions of (128,8) have been extended to (128,16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
As described in \cref{sec:memory_layout}, the microkernel must therefore execute on both \acp{pch}, which is ensured because when generating the \ac{rd} and \ac{wr} commands for the matrix addresses, the respective \ac{pch} is implicitly addressed.
With the (128,16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses.
With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
\begin{listing}
\begin{verbatim}
@@ -189,4 +189,13 @@ The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, l
\subsubsection{Benchmark Environment}
One crucial missing piece to measure the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
One crucial missing piece for measuring the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
The gem5 simulator reports this number of ticks and other statistics in a file at the end of the simulation.
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
This can be achieved using the so-called M5ops.
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization.
This provides a fair basis for comparison and allows a number of comparative simulations to be performed.