Software Library chapter complete

2024-02-15 17:48:35 +01:00
parent 88d46788c7
commit 1e993eeb28
2 changed files with 49 additions and 3 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -288,3 +288,7 @@
    short = AB-PIM,
    long = All-Bank-PIM,
 }
 \DeclareAcronym{dsb}{
    short = DSB,
    long = Data Synchronization Barrier,
 }
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -91,11 +91,19 @@ To leverage \aca{fimdram} to accelerate \ac{dnn} applications however, the libra
 As already discussed in \cref{sec:memory_layout}, the weight matrix must be laid out in a column-major fashion, grouped in vectors of 16 \ac{fp16} elements.
 To avoid reinventing numerous routines for initializing and manipulating matrices, the publicly available open-source linear algebra library nalgebra \cite{nalgebra} is used.
 In order to achieve the packed \ac{fp16} layout, a special \ac{simd} data type abstraction is used, while taking into account the changed dimensions of the matrix.
 Following the same consideration as with the \texttt{BankArray}, the weight matrix must be aligned to a $\qty{512}{\byte}$ boundary, to ensure that the first matrix element is placed on the first bank of the \ac{pch}.
 However, when using the \ac{aam} execution mode, this is not sufficient.
 As already shown in \cref{img:aam}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
 With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
 Therefore, the larger alignment requirement of $2^6*\qty{512}{\byte}=\qty{32768}{\byte}$ must be made for the weight matrix.
 Besides the weight matrices, the input vector must adhere an interleaved layout at the granularity of the 16-wide \ac{fp16} vector, as described in \cref{sec:memory_layout}.
 The number of the copies of each chunk is equal to the number of processing units in each \ac{pch}.
 While it would be possible to use the \ac{ab} mode of \aca{fimdram}, the interleaving is done in software for the purpose of debugability, since the initialization step cannot be modeled accurately regardless due to the \ac{json}-based communication protocol.
 The alignment requirement of $\qty{512}{\byte}$ is sufficient for the input vector to ensure it resides at the boundary of the first bank in the respective \ac{pch}.
 In addition to the input vector, the output of the \ac{gemv} kernel is not a flat vector, but a 16-column matrix that must be reduced by the host after the \ac{pim} operation.
-Therefore, before the operation, the output matrix must be allocated as a vector consisting of a \ac{simd} \ac{fp16} vector for every matrix row.
+Therefore, before the operation, the output matrix must be allocated as a vector consisting of a \ac{simd} \ac{fp16} vector for every matrix row while also adhering the $\qty{512}{\byte}$ alignment.
 The bank interleaving of the \ac{am} leads to the correct, sequential representation in linear address space after the \ac{mac} results are written from the \ac{grf} register files to the memory banks.
 The host can then simply read the result from the pre-allocated output vector and reduce the results to prepare them for the next \ac{dnn} layer.
@@ -103,5 +111,39 @@ With the introduced data structures used for addition, scalar multiplication and
 The implementation of the \aca{fimdram} execution model is explained in the following section.
 \subsubsection{Microkernel Execution}
-% - microkernel execution
+
-% - cache management
+The host processor executes the \ac{pim} microkernel by first switching to the \ac{abp} mode and then issuing the required \ac{rd} and \ac{wr} memory requests by executing \ac{ld} and \ac{st} instructions.
 When executing control instructions or data movement instructions that operate only on the register files, the \ac{rd} and \ac{wr} requests must be located in a dummy region of memory where no actual data is stored, but which must be allocated beforehand.
 Further, when data is read from or written to the memory banks, these memory requests are issued with the correct address for the data.
 Since a memory request triggers the execution of all processing units in a \ac{pch}, the \ac{ld} and \ac{st} instructions may not cover the complete input and output data from the processors' perspective:
 From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
 This requires special indexing of the input vectors and matrices, since they must be accessed very sparsely.
 In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}*\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
 This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
 To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
 Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
 As the input address of the weight matrix grows, the \ac{grf}-A and \ac{grf}-B indices are incremented in such a way that the \ac{grf}-A registers are read repeatedly to multiply the weights by the input vector, while the \ac{grf}-B registers are incremented in the outer loop to hold the results of additional matrix rows.
 Besides generating memory requests, an important task of the software library is to maintain the data coherence of the program.
 The compiler may introduce invariants with respect to the value of the output vector, since it does not see that the value of the vector has changed without the host explicitly writing to it.
 As a result, the compiler may make optimizations that are not obvious to the programmer, such as reordering memory accesses, that cause the program to execute incorrectly.
 To avoid this, not only between non-\ac{aam} instructions in the microkernel, the processor must introduce memory barriers after initializing the input operands and before reading the output vector, to ensure that all memory accesses and \ac{pim} operations are completed.
 On an ARM processor, such a memory barrier instruction is called \ac{dsb} \cite{arm2015}.
 Until now, the \ac{pim} memory region was assumed to be in a non-cacheable memory region, bypassing the on-chip cache in such a way that each \ac{ld} and \ac{st} instruction generates a \ac{rd} and \ac{wr} memory request, respectively.
 However, where this is not possible, the software library has to control the fetching, cleaning, flushing and pre-allocating of cache lines using special data cache instructions.
 Before executing a \ac{ld} instruction, the host processor must first make sure, that the cache line associated with the load address is not valid, as otherwise this stored value will be fetched from the cache and no \ac{rd} command will be sent to the memory.
 For this, an invalidate instruction must be used, followed by a memory barrier instruction to ensure that the cache operation completes before the \ac{ld}.
 Note, that the invalidate instruction may not flush the stored data, as this would cause an unwanted \ac{wr}.
 After the \ac{ld} is executed, another memory barrier must follow.
 Executing a \ac{st} instruction, on the other hand, is slightly more complex.
 Similar to the \ac{rd} command, the corresponding cache line must be in a defined state, namely it must be in the valid state.
 However, this cannot be achieved by fetching the cache line first, as this could cause an unwanted \ac{rd} memory access.
 Instead, ARM provides a special zero preload instruction that initializes the cache line with zeros without triggering a memory access.
 This is normally an instruction used for optimizing the performance of a program by explicit cache management, but proved to be crucial for \ac{pim} kernel execution.
 After another memory barrier, the processor can now execute the \ac{st} instruction.
 Once the cache line is marked as dirty by writing to it, it must also be flushed explicitly afterward, otherwise the flush and the subsequent \ac{wr} command would not be issued until the cache line is evicted at a later time.
 Finally, another memory barrier must synchronize the memory operations, otherwise the flushed cache line could be stuck in the write buffer of the cache for a considerable amount of time.
 With providing these utility routines for executing the \ac{pim} microkernel, all tools are now available to build an application that makes proper use of \aca{fimdram} for accelerating \ac{dnn} applications.