Clarify layout of additional matrix rows
This commit is contained in:
@@ -371,9 +371,9 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img
|
||||
|
||||
In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum.
|
||||
This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}.
|
||||
Increasing the number of rows in the matrix simply requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
|
||||
Increasing the number of rows in the matrix requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows.
|
||||
However, the additional matrix rows must be stored as a separate matrix after the first 8-row matrix block, forming an array of separate 8-row matrices.
|
||||
As a side effect of the incremented matrix row address, this also results in an increment of the \ac{grf}-B index, making it possible to increase the maximum number of matrix rows to $8 \cdot 8=64$ before all eight \ac{grf}-B entries are filled with partial sums, as demonstrated in \cref{lst:gemv64}.
|
||||
|
||||
\begin{listing}
|
||||
\begin{verbatim}
|
||||
MAC(AAM) GRF_B, BANK, GRF_A
|
||||
@@ -382,9 +382,10 @@ JUMP -1, 63
|
||||
\caption[The core of a \ac{mac} microkernel that utilizes the maximum number of register entries]{The core of a \ac{mac} microkernel that utilizes the maximum number of register entries.}
|
||||
\label{lst:gemv64}
|
||||
\end{listing}
|
||||
A further increase in the total number of rows can be achieved by distributing the weight matrix over multiple \acp{pch} and running the microkernel multiple times, concatenating the output vectors on the host at the end.
|
||||
|
||||
To increase the number of columns, new entries of the input vector must be loaded into the processing units.
|
||||
Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns.
|
||||
Therefore, it is necessary to execute the entire \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns, and merge the resulting output vectors by adding them on the host.
|
||||
In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed.
|
||||
|
||||
\subsubsection{Performance and Power Efficiency Effects}
|
||||
|
||||
Reference in New Issue
Block a user