Clarify layout of additional matrix rows

2024-02-21 16:35:12 +01:00
parent 2e6187c25a
commit 96311f2308
1 changed files with 4 additions and 3 deletions
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -371,9 +371,9 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img

 In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
 This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
-Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
+Increasing the number of rows in the matrix requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
+However, the additional matrix rows must be stored as a separate matrix after the first 8-row matrix block, forming an array of separate 8-row matrices.
 As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8 \cdot 8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
-
 \begin{listing}
 \begin{verbatim}
 MAC(AAM) GRF_B, BANK, GRF_A
@@ -382,9 +382,10 @@ JUMP -1, 63
 	\caption[The core of a \ac{mac} microkernel that utilizes the maximum number of register entries]{The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.}
 	\label{lst:gemv64}
 \end{listing}
+A further increase in the total number of rows can be achieved by distributing the weight matrix over multiple \acp{pch} and running the microkernel multiple times, concatenating the output vectors on the host at the end.

 To increase the number of columns, new entries of the input vector must be loaded into the processing units.
-Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns.
+Therefore, it is necessary to execute the entire \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns, and merge the resulting output vectors by adding them on the host.
 In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.

 \subsubsection{Performance and Power Efficiency Effects}