Kernel chapter complete
This commit is contained in:
@@ -158,9 +158,9 @@ Since different channels would only be used to increase the dimensions of the ma
|
||||
\subsubsection{GEMV Microkernel}
|
||||
|
||||
With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library.
|
||||
The native matrix dimensions of (128,8) have been extended to (128,16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
|
||||
As described in \cref{sec:memory_layout}, the microkernel must therefore execute on both \acp{pch}, which is ensured because when generating the \ac{rd} and \ac{wr} commands for the matrix addresses, the respective \ac{pch} is implicitly addressed.
|
||||
With the (128,16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
|
||||
The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16).
|
||||
The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses.
|
||||
With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}.
|
||||
|
||||
\begin{listing}
|
||||
\begin{verbatim}
|
||||
@@ -189,4 +189,13 @@ The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, l
|
||||
|
||||
\subsubsection{Benchmark Environment}
|
||||
|
||||
One crucial missing piece to measure the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
|
||||
One crucial missing piece for measuring the performance gains of \aca{fimdram} in gem5 is an accurate way of counting the clock cycles of the simulated out-of-order processor.
|
||||
The gem5 simulator reports this number of ticks and other statistics in a file at the end of the simulation.
|
||||
However, since the boot process, the setup of the matrix operands, and the mode switching of the processing units should not be captured, a more fine-grained control is necessary.
|
||||
This can be achieved using the so-called M5ops.
|
||||
By using special instructions that the processor model interprets, it is possible to control the recording of the statistics directly from the simulated application.
|
||||
Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way.
|
||||
These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows:
|
||||
Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel.
|
||||
To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization.
|
||||
This provides a fair basis for comparison and allows a number of comparative simulations to be performed.
|
||||
|
||||
Reference in New Issue
Block a user