Kernel chapter complete

2024-02-16 19:38:11 +01:00
parent 5055b0eb8a
commit db35568157
6 changed files with 24 additions and 14 deletions
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -158,9 +158,9 @@ Since different channels would only be used to increase the dimensions of the ma
 \subsubsection{GEMV Microkernel}

 With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
-The native matrix dimensions of (128,8) have been extended to (128,16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
-As described in \cref{sec:memory_layout}, the microkernel must therefore execute on both \acp{pch}, which is ensured because when generating the \ac{rd} and \ac{wr} commands for the matrix addresses, the respective \ac{pch} is implicitly addressed.
-With the (128,16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
+The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
+The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses.
+With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.

 \begin{listing}
 \begin{verbatim}
@@ -189,4 +189,13 @@ The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, l

 \subsubsection{Benchmark Environment}

-One crucial missing piece to measure the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
+One crucial missing piece for measuring the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
+The gem5 simulator reports this number of ticks and other statistics in a file at the end of the simulation.
+However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
+This can be achieved using the so-called M5ops.
+By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
+Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
+These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
+Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
+To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization.
+This provides a fair basis for comparison and allows a number of comparative simulations to be performed.