GEMV kernel

2024-02-16 17:37:21 +01:00
parent 21c2489766
commit 5055b0eb8a
1 changed files with 32 additions and 2 deletions
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -156,7 +156,37 @@ As only one channel is simulated, the simulation does not take into account othe
 Since different channels would only be used to increase the dimensions of the matrices further than it is done in this thesis, and the channels are completely independent of each other, this does not change the timing behavior of the simulation.
 \subsubsection{GEMV Microkernel}
-% heap allocation
+
 With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
 The native matrix dimensions of (128,8) have been extended to (128,16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
 As described in \cref{sec:memory_layout}, the microkernel must therefore execute on both \acp{pch}, which is ensured because when generating the \ac{rd} and \ac{wr} commands for the matrix addresses, the respective \ac{pch} is implicitly addressed.
 With the (128,16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
 \begin{listing}
 \begin{verbatim}
 MOV GRF_A #0, BANK
 MOV GRF_A #1, BANK
 MOV GRF_A #2, BANK
 MOV GRF_A #3, BANK
 MOV GRF_A #4, BANK
 MOV GRF_A #5, BANK
 MOV GRF_A #6, BANK
 MOV GRF_A #7, BANK
 MAC(AAM) GRF_B, BANK, GRF_A
 JUMP -1, 7
 FILL BANK, GRF_B #0
 EXIT
 \end{verbatim}
 	\caption[A complete \ac{gemv} microkernel]{A complete \ac{gemv} microkernel.}
 	\label{lst:gemv_microkernel}
 \end{listing}
 First, the input vector is loaded into all eight \ac{grf}-A registers, followed by the \ac{mac} core, which iteratively multiplies chunks of a matrix row with the input vector chunks and stores them in the first \ac{grf}-B register.
 Then, the FILL instruction writes the computed partial sum into the memory bank, followed by an EXIT instruction that resets the processing units to a defined state.
 Note that even though the microkernel consists of only 12 instructions, the host processor has to send in total 36 memory requests to the memory.
 On the one hand because of the JUMP instruction, which is not executed itself, but repeats the previous instruction 7 times, and on the other hand because the memory requests have to be sent to both \ac{pch} which effectively executes the microkernel twice.
 The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, load the partial sum vector from memory, reduce it, and possibly prepare it for the next \ac{dnn} layer in the same way as the input vector was prepared.
 \subsubsection{Benchmark Environment}
-% m5ops
+
 One crucial missing piece to measure the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.