Fixes until Library chapter

2024-02-17 15:47:38 +01:00
parent a49d409d4c
commit cf0b8c3984
3 changed files with 40 additions and 37 deletions
--- a/src/chapters/implementation.tex
+++ b/src/chapters/implementation.tex
@@ -3,7 +3,7 @@
 The implementation of the \aca{fimdram} model is divided into three distinct parts:
 Firstly, the processing units in the \acp{pch} of \aca{hbm} are integrated into the \ac{dram} model of DRAMSys.
-Secondly, a software library that uses the \ac{pim} implementation provides a \ac{api} to take advantage of in-memory processing from a user application.
+Secondly, a software library that uses the \ac{pim} implementation provides an \ac{api} to take advantage of in-memory processing from a user application.
 Finally, the software library is used in a gem5-based bare-metal kernel to perform \ac{pim} operations.
 \input{chapters/implementation/vm}
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -1,28 +1,27 @@
 \subsection{Software Library}
 \label{sec:library}
-With the implementation of the \ac{pim} processing units, a crucial missing piece that for simulating \aca{fimdram} is software support to effectively utilize the new architecture.
+With the implementation of the \ac{pim} processing units, a crucial missing piece that is needed for the simulation of \aca{fimdram} is software support to make effective use of the new architecture.
-While it is possible to shift the responsibility for interacting with the \aca{fimdram} to the application developer, it is more preferable to provide a sophisticated software library that offers users an easy-to-use \ac{api} for interfacing with the \ac{pim} implementation.
+While it is possible to shift the responsibility for interacting with \aca{fimdram} to the application developer, it is more preferable to provide a sophisticated software library that offers users an easy-to-use \ac{api} for interfacing with the \ac{pim} implementation.
 Such a \ac{pim} library must include the following essential features to fully interact with the processing units in memory:
 \begin{itemize}
 \item It must support the \textbf{mode setting} required to switch between \ac{sb}, \ac{ab} and \ac{abp} mode.
 \item It should provide data structures to assemble \textbf{microkernels} and functions to transfer the microkernels to the \acp{crf} of the processing units.
-\item To meet the \textbf{memory layout} requirements of the inputs and outputs of an algorithm, it should provide data structures to represent vectors and matrices according to the special layout constraints.
+\item To meet the \textbf{memory layout} requirements of the inputs and outputs of an algorithm, it should provide data structures, that represent vectors and matrices according to the special layout constraints.
 \item After switching the mode to \ac{abp}, the library should provide functionality to \textbf{execute a user-defined microkernel} by issuing the necessary memory requests through the execution of \ac{ld} and \ac{st} instructions.
 \item For platforms, where it is not possible to mark the \ac{pim} memory region as non-cacheable, the library should provide the necessary \textbf{cache management} operations to bypass the cache filtering and to generate the right amount of \ac{rd} and \ac{wr} \ac{dram} commands.
 \end{itemize}
 As already discussed in \cref{sec:vm}, for simplicity and debugability reasons, the host processor communicates with the \ac{pim} model in the \ac{dram} using a \ac{json}-based protocol.
 To achieve this, a small shared library, that defines the communication data structures as well as routines to serialize and deserialize them, is linked by both the \ac{pim} support library as well as the \ac{pim} model in DRAMSys.
-A predefined memory region is then used to differentiate these communication messages from regular the regular memory traffic.
+A predefined memory region is then used to differentiate these communication messages from the regular memory traffic.
 Ideally, this memory region is also set as non-cacheable, so that the messages do not get stuck in the on-chip cache.
 Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region.
 With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units.
 An instruction is simply represented by one of 9 different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}.
 \begin{listing}
 \begin{minipage}[t,c]{0.45\linewidth}
 \begin{minted}{rust}
@@ -55,7 +54,6 @@ enum File {
 	\caption[The \texttt{enum} definitions of the instructions and register files]{The \texttt{enum} definitions of the instructions and register files.}
 	\label{lst:instruction_enums}
 \end{listing}
 A microkernel is then simply an array consisting of instructions of size 32.
 \subsubsection{Data Structures}
@@ -69,15 +67,15 @@ For such a flat array, several things have to be considered:
 \item The start of the array must lie on the first bank of the \ac{pch} and the end of the array must lie on the last bank of the \ac{pch}.
 \end{itemize}
-The software library introduces the \texttt{BankArray} data structure, which has the size of $\qty{32}{\byte}\cdot\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, holding in total 256 \ac{fp16} numbers.
+The software library introduces the \texttt{BankArray} data structure, which has the size of $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, holding a total 256 of \ac{fp16} numbers.
-To guarantee the correct placement, an alignment of 512 is explicitly enforced.
+To guarantee the correct placement at the boundary of the first bank, an alignment of $\qty{512}{\byte}$ is explicitly enforced.
 While it may seem at first that the compiler implicitly enforces this alignment, this is not true for arrays, consisting of smaller data types - the compiler only enforces a $\qty{2}{\byte}$ alignment for the \ac{fp16} array, since a \ac{fp16} number is $\qty{2}{\byte}$ in size.
 This memory layout assumes a bank interleaving \ac{am}, where after a complete burst the memory controller addresses the next bank of the \ac{pch}.
 To support arrays larger than $\qty{512}{\byte}$, the \texttt{BankArray} can also be instantiated multiple times in a larger \texttt{ComputeArray}.
 This \texttt{ComputeArray} inherits the alignment requirements of the \texttt{BankArray}, so that it does not need to be explicitly aligned.
 Also, arrays smaller than $\qty{512}{\byte}$ are possible by simply not filling the entire array with values.
-However, the \texttt{BankArray} may not be smaller than this minimum size as it must at least span all banks of a \ac{pch} to reserve the memory regions, so that the compiler does not put other data in this region, as those would be overwritten with invalid data during a \ac{pim} operation.
+However, the \texttt{BankArray} may not be smaller than this minimum size as it must at least span all banks of one \ac{pch} to reserve these memory regions, so that the compiler does not place other data in this region, as those would be overwritten with invalid data during a \ac{pim} operation.
 This \texttt{ComputeArray} and \texttt{BankArray} layout is illustrated in \cref{img:compute_array}.
 \begin{figure}
@@ -87,11 +85,11 @@ This \texttt{ComputeArray} and \texttt{BankArray} layout is illustrated in \cref
 	\label{img:compute_array}
 \end{figure}
-To leverage \aca{fimdram} to accelerate \ac{dnn} applications however, the library must also support data structures to represent matrices and vectors in the required memory layout.
+To leverage \aca{fimdram} to accelerate \ac{dnn} applications however, the library must also support data structures to represent matrices and vectors with the required memory layout.
 As already discussed in \cref{sec:memory_layout}, the weight matrix must be laid out in a column-major fashion, grouped in vectors of 16 \ac{fp16} elements.
 To avoid reinventing numerous routines for initializing and manipulating matrices, the publicly available open-source linear algebra library nalgebra \cite{nalgebra} is used.
 In order to achieve the packed \ac{fp16} layout, a special \ac{simd} data type abstraction is used, while taking into account the changed dimensions of the matrix.
-Following the same consideration as with the \texttt{BankArray}, the weight matrix must be aligned to a $\qty{512}{\byte}$ boundary, to ensure that the first matrix element is placed on the first bank of the \ac{pch}.
+Following the same consideration as with the \texttt{BankArray}, the weight matrix must be aligned to a $\qty{512}{\byte}$ boundary, to ensure that the first matrix element is placed at the boundary of the first bank of the \ac{pch}.
 However, when using the \ac{aam} execution mode, this is not sufficient.
 As already shown in \cref{img:aam}, the \ac{grf}-A and \ac{grf}-B indices are calculated from the column and row address of the triggering memory access.
 With an alignment of $\qty{512}{\byte}$, no assumptions can be made about the initial value of the \ac{grf}-A and \ac{grf}-B indices, while for the execution of a complete \ac{gemv} kernel, both indices should start with zero.
@@ -99,13 +97,13 @@ Therefore, the larger alignment requirement of $2^6\cdot\qty{512}{\byte}=\qty{32
 Besides the weight matrices, the input vector must adhere an interleaved layout at the granularity of the 16-wide \ac{fp16} vector, as described in \cref{sec:memory_layout}.
 The number of the copies of each chunk is equal to the number of processing units in each \ac{pch}.
-While it would be possible to use the \ac{ab} mode of \aca{fimdram}, the interleaving is done in software for the purpose of debugability, since the initialization step cannot be modeled accurately regardless due to the \ac{json}-based communication protocol.
+While it would be possible to use the \ac{ab} mode of \aca{fimdram} to initialize the input vector, the interleaving is done in software for the purpose of debugability, since the initialization step cannot be modeled accurately regardless due to the \ac{json}-based communication protocol.
-The alignment requirement of $\qty{512}{\byte}$ is sufficient for the input vector to ensure it resides at the boundary of the first bank in the respective \ac{pch}.
+Here, the alignment requirement of $\qty{512}{\byte}$ is sufficient for the input vector to ensure it resides at the boundary of the first bank in the respective \ac{pch}.
 In addition to the input vector, the output of the \ac{gemv} kernel is not a flat vector, but a 16-column matrix that must be reduced by the host after the \ac{pim} operation.
 Therefore, before the operation, the output matrix must be allocated as a vector consisting of a \ac{simd} \ac{fp16} vector for every matrix row while also adhering the $\qty{512}{\byte}$ alignment.
 The bank interleaving of the \ac{am} leads to the correct, sequential representation in linear address space after the \ac{mac} results are written from the \ac{grf} register files to the memory banks.
-The host can then simply read the result from the pre-allocated output vector and reduce the results to prepare them for the next \ac{dnn} layer.
+The host can then simply read the result from the preallocated output vector and reduce the results to prepare them for the next \ac{dnn} layer.
 With the introduced data structures used for addition, scalar multiplication and \ac{gemv} kernels, the software library must also support the execution of the programmed \ac{pim} microkernels.
 The implementation of the \aca{fimdram} execution model is explained in the following section.
@@ -118,9 +116,9 @@ When executing control instructions or data movement instructions that operate o
 Further, when data is read from or written to the memory banks, these memory requests are issued with the correct address for the data.
 Since a memory request triggers the execution of all processing units in a \ac{pch}, the \ac{ld} and \ac{st} instructions may not cover the complete input and output data from the processors' perspective:
 From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
-This requires special indexing of the input vectors and matrices, since they must be accessed very sparsely.
+This requires a special indexing of the input vectors and matrices, since they must be accessed very sparsely.
-In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{\#\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
+In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
 This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
 To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
 Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
@@ -133,7 +131,7 @@ To avoid this, not only between non-\ac{aam} instructions in the microkernel, th
 On an ARM processor, such a memory barrier instruction is called \ac{dsb} \cite{arm2015}.
 Until now, the \ac{pim} memory region was assumed to be in a non-cacheable memory region, bypassing the on-chip cache in such a way that each \ac{ld} and \ac{st} instruction generates a \ac{rd} and \ac{wr} memory request, respectively.
-However, where this is not possible, the software library has to control the fetching, cleaning, flushing and pre-allocating of cache lines using special data cache instructions.
+However, where this is not possible, the software library has to control the fetching, cleaning, flushing and preallocating of cache lines using special data cache instructions.
 Before executing a \ac{ld} instruction, the host processor must first make sure, that the cache line associated with the load address is not valid, as otherwise this stored value will be fetched from the cache and no \ac{rd} command will be sent to the memory.
 For this, an invalidate instruction must be used, followed by a memory barrier instruction to ensure that the cache operation completes before the \ac{ld}.
 Note, that the invalidate instruction may not flush the stored data, as this would cause an unwanted \ac{wr}.
@@ -142,9 +140,13 @@ Executing a \ac{st} instruction, on the other hand, is slightly more complex.
 Similar to the \ac{rd} command, the corresponding cache line must be in a defined state, namely it must be in the valid state.
 However, this cannot be achieved by fetching the cache line first, as this could cause an unwanted \ac{rd} memory access.
 Instead, ARM provides a special zero preload instruction that initializes the cache line with zeros without triggering a memory access.
-This is normally an instruction used for optimizing the performance of a program by explicit cache management, but proved to be crucial for \ac{pim} kernel execution.
+This is normally an instruction used for optimizing the performance of a program using explicit cache management, but proved to be crucial for \ac{pim} kernel execution.
 After another memory barrier, the processor can now execute the \ac{st} instruction.
 Once the cache line is marked as dirty by writing to it, it must also be flushed explicitly afterward, otherwise the flush and the subsequent \ac{wr} command would not be issued until the cache line is evicted at a later time.
 Finally, another memory barrier must synchronize the memory operations, otherwise the flushed cache line could be stuck in the write buffer of the cache for a considerable amount of time.
 During the development of this cache management approach, it became apparent that the cache may not be sufficiently controllable by the user program.
 The compiler may introduce additional stack variables and memory accesses that are not obvious to the developer, rendering the explicit generation of \ac{rd} and \ac{wr} commands nearly impossible.
 Therefore, these critical sections would have to be written in an assembly language to have the necessary control over the processor.
 However, other user programs running in the background at the same time, would also make this approach very difficult.
 With providing these utility routines for executing the \ac{pim} microkernel, all tools are now available to build an application that makes proper use of \aca{fimdram} for accelerating \ac{dnn} applications.
--- a/src/chapters/implementation/vm.tex
+++ b/src/chapters/implementation/vm.tex
@@ -2,17 +2,18 @@
 \label{sec:vm}
 \subsubsection{Integration}
-To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
+To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels.
-However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim}, no changes are required in the memory controller.
+They also need to be provided it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
 However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller.
 In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
 As a result, the \aca{fimdram} model itself does not need to model any timing behavior: its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys.
 This leads to a significantly simplified model, since the internal pipeline stages of \aca{fimdram} do not need to be modeled, but only the functional behavior of a processing unit to the outside.
-While \aca{fimdram} is in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory.
+While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory.
 Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active.
 As already described in \cref{sec:pim_architecture}, \aca{fimdram} expects certain sequences of \ac{act} and \ac{pre} sequences to initiate a mode transition.
-Unfortunately, Samsung did not specify this mechanism in any more detail than that, so the actual implementation of the mode switching in the \aca{fimdram} model has been simplified to a \ac{json}-based communication protocol, to achieve a maximum flexibility and debugging ability from a development perspective.
+Unfortunately, Samsung did not specify this mechanism in any more detail than that, so the actual implementation of the mode switching in the \aca{fimdram} model has been simplified to a \ac{json}-based communication protocol, to achieve maximum flexibility and debugging ability from a development perspective.
-In this mechanism, the host processor builds \ac{json} messages at runtime and writes the raw serialized string representation of it to a pre-defined location in memory.
+In this mechanism, the host processor builds \ac{json} messages at runtime and writes its raw serialized string representation to a predefined location in memory.
 The \ac{dram} model then inspects incoming \ac{wr} commands in this memory region and deserializes the content of these memory accesses to reconstruct the message of the host.
 As a downside of this method, the actual mode switching cannot be simulated with accurate timing, as a \ac{json} message might be composed of more than one memory packet.
 With more information from Samsung on how the actual mechanism is implemented, this implementation can be trivially switched over to it at a later date.
@@ -22,42 +23,42 @@ This mode can be used by the host to initialize the input vector chunk interleav
 After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{ab}-\ac{pim} mode, which allows the execution of instructions in the processing units.
 The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units.
-In case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object.
+In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor.
 This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks.
 \subsubsection{Implementation}
-So far, only the additional infrastructure in the \ac{dram} model of DRAMSys and the integration of the processing units have been described.
+So far, only the additional infrastructure in the \ac{dram} model of DRAMSys for the integration of the processing units have been described.
 Now follows the implementation of the processing units themselves.
 The internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
 As a simplification of the model, the \acp{crf} are not stored in each \ac{pim} unit, but are stored once globally for each \ac{pch}.
 Functionally, this does not change the behavior of the system, assuming that each processing unit is programmed with the same microkernel, which is the case for all the programs examined in this thesis.
-Depending on a \ac{rd} or \ac{wr} command either the method \mint{rust}{execute_read(address: u64, bank_data: &[u8])} or the method \mint{rust}{execute_write() -> [u8; 32]} is called on the instance of a \ac{pim} unit.
+Depending on a \ac{rd} or \ac{wr} command either the method \mint{rust}{fn execute_read(address: u64, bank_data: &[u8])} or the method \mint{rust}{fn execute_write() -> [u8; 32]} is called on the instance of a \ac{pim} unit.
 The most important difference between these two methods is their signatures.
 While the former takes the address and the bank data to be read as input, the latter only outputs the bank data of the size of a full burst to be written into the respective bank.
 However, both methods execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit.
-The \texttt{execute\_read} method begins with calculating the register indices used by the \ac{aam} followed by a branch table that dispatches to the handler of the current instruction.
+The \texttt{execute\_read} method begins with calculating the register indices used by the \ac{aam} execution mode followed by a branch table that dispatches to the handler of the current instruction.
 In case of the EXIT control instruction, the internal state of the processing unit is reset to its default configuration.
 The data movement instructions MOV and FILL both only perform a simple move operation that loads to value of one register or the bank data and assigns it to the destination register.
-A more complex implementation require the four arithmetic instructions ADD, MUL, MAC and MAD:
+A more complex implementation require the four arithmetic instructions ADD, MUL, \ac{mac} and \ac{mad}:
 Depending on the \ac{aam} flag set in the instruction format, as seen in \cref{tab:isa}, either the indices set by the instruction itself will be used, or the ones previously calculated from the row and column address of the memory access.
-In the case of the simple ADD and MUL instructions, the respective operand data is then fetched from their respective sources.
+In the case of the simple ADD and MUL instructions, the operand data is then fetched from their respective sources.
-The MAC and MAD instructions differ in the sense that they require a total of three input operands, one of which may be the destination register in the case of MAC.
+The \ac{mac} and \ac{mad} instructions differ in the sense that they require a total of three input operands, where one of which may be the destination register.
 In the first step, the multiplication of the first two input operands is performed in the same way as in MUL.
 Then, this temporary product is added to the third source register as in ADD.
 Finally, this sum is written to the destination register.
-Note that while the MAC instruction can iteratively add to the same destination register, it does not reduce the 16-wide \ac{fp16} vector itself in any way.
+Note that while the \ac{mac} instruction can iteratively add to the same destination register, it does not reduce the 16-wide \ac{fp16} vector itself in any way.
-As already seen in \cref{sec:memory_layout}, the host processor is responsible for reducing these 16 floating point numbers to one.
+As already seen in \cref{sec:memory_layout}, the host processor is responsible for reducing these 16 floating point numbers into one \ac{fp16} number.
 After the execution of one instruction, the program counter is incremented.
-One special instruction, the JUMP instruction, is processed at the end of an execution step.
+One special instruction, the JUMP instruction, is processed at the end of an execution step:
 The JUMP instruction is a zero-cycle instruction, i.e., it is not actually executed normally by triggering it with a \ac{rd} command.
 Instead, the jump offset and iteration count are resolved statically at the end of a regular instruction.
 Depending on the jump counter of the processing unit, the counter is either initialized with the jump count specified in the instruction, or it is decremented by one.
-If the new jump counter has not reached zero, the jump to the offset instruction will be performed.
+If the new jump counter has not yet reached zero, the jump to the offset instruction will be performed.
 If not, the execution continues as is.
-This implementation only works for non-nested JUMP instructions, as for each step of nesting would require a new jump counter.
+This implementation only works for non-nested JUMP instructions, as each level of nesting would require its own jump counter.
 From the information provided by Samsung, it is not clear whether nested JUMP instructions are implemented in \aca{fimdram}.
-However, none of the microkernels examined in this thesis use nested jumps.
+However, none of the microkernels examined in this thesis use nested JUMPs.
 As already seen in \cref{tab:instruction_set}, only the FILL instruction supports writing to the memory bank.
 Therefore, it is the only instruction implemented in the \texttt{execute\_write} method.