Apply Lukas' remarks

2024-03-08 22:36:31 +01:00
parent b04001698b
commit c0c9afc891
6 changed files with 34 additions and 31 deletions
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -157,7 +157,7 @@
 }
 \DeclareAcronym{tsv}{
    short = TSV,
-    long = trough-silicon via,
+    long = through-silicon via,
 }
 \DeclareAcronym{pch}{
    short = pCH,
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -97,7 +97,7 @@ Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which w
 \label{sec:hbm}
 \Aca{hbm} is a \ac{dram} standard that has been defined by \ac{jedec} in 2016 as a successor of the previous \ac{hbm} standard \cite{jedec2015a}.
-What differentiates \ac{hbm} from other types of memory is its \ac{sip} approach.
+What differentiates \ac{hbm} from other types of \ac{dram} is its \ac{sip} approach.
 Several \ac{dram} dies are stacked on top of each other and connected with \acp{tsv} to form a cube of memory dies consisting of many die layers and a buffer die at the bottom, as shown in \cref{img:sip}.
 \begin{figure}
 	\centering
@@ -106,14 +106,15 @@ Several \ac{dram} dies are stacked on top of each other and connected with \acp{
 	\label{img:sip}
 \end{figure}
 Such a cube is then placed onto a common silicon interposer that connects the \ac{dram} to its host processor.
-This packaging brings the memory closer to the \ac{mpsoc}, which reduces the latency, minimizes the bus capacitance and, most importantly, allows for an extraordinary wide memory interface.
+This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance.
 For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
 One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups.
 The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
-With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
+The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is.
-A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration.
+In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}.
-In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}.
+With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which results in $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
 A single data transfer is performed with either a \ac{bl} of 2 in legacy mode or 4 in \ac{pch} mode.
 Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
 \cref{img:hbm} illustrates the internal architecture of a single memory die.
@@ -128,6 +129,7 @@ In the center of the die, the \acp{tsv} connect the die to the next die above it
 \end{figure}
 % still, bandwidth requirements of new AI applications are not met by HBM2:waq
-Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bound limitations.
+Even though \aca{hbm} provides a very high memory bandwidth, many modern \acp{dnn} applications still reside in the memory-bound limitations.
 While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}.
-Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor for performing operations on it.
+Another approach could be \acf{pim}:
 By integrating additional compute units directly into the memory stacks, the achievable parallel bandwidth is further increased and the burden of transferring all operand data to and from the host processor to perform operations on it is reduced.
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -84,7 +84,7 @@ The \ac{uart} device model in gem5 then redirects the written messages either to
 Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it.
 During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons:
-Firstly, as the \ac{pim} arrays become very large with high matrix dimensions and may not fit in the preallocated stack region.
+Firstly, because the \ac{pim} arrays become very large at high matrix dimensions and may not fit into the pre-allocated region of the stack.
 Secondly, and most importantly, because the stack resides in the normal cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures.
 As an alternative, it would be possible to preallocate all \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file and mapping that section to the \ac{pim}-enabled \acp{pch}.
 However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time.
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -86,7 +86,7 @@ This \texttt{ComputeArray} and \texttt{BankArray} layout is illustrated in \cref
 \end{figure}
 To leverage \aca{fimdram} to accelerate \ac{dnn} applications however, the library must also support data structures to represent matrices and vectors with the required memory layout.
-As already discussed in \cref{sec:memory_layout}, the weight matrix must be laid out in a column-major fashion, grouped in vectors of 16 \ac{fp16} elements.
+As already discussed in \cref{sec:memory_layout}, the weight matrix must be laid out in a column-major fashion, grouped into vectors of 16 \ac{fp16} elements each.
 To avoid reinventing numerous routines for initializing and manipulating matrices, the publicly available open-source linear algebra library nalgebra \cite{nalgebra} is used.
 In order to achieve the packed \ac{fp16} layout, a special \ac{simd} data type abstraction is used, while taking into account the changed dimensions of the matrix.
 Following the same consideration as with the \texttt{BankArray}, the weight matrix must be aligned to a $\qty{512}{\byte}$ boundary, to ensure that the first matrix element is placed at the boundary of the first bank of the \ac{pch}.
--- a/src/chapters/implementation/vm.tex
+++ b/src/chapters/implementation/vm.tex
@@ -3,7 +3,7 @@
 \subsubsection{Integration}
 To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels.
-They also need to be provided it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
+They must also receive the burst data from the \acp{ssa} and the burst address to calculate the register indices for the \ac{aam} mode.
 However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller.
 In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
 As a result, the \aca{fimdram} model itself does not need to model any timing behavior:
@@ -12,7 +12,7 @@ This leads to a significantly simplified model, since the internal pipeline stag
 While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory.
 Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active.
-As already described in \cref{sec:pim_architecture}, \aca{fimdram} expects certain sequences of \ac{act} and \ac{pre} sequences to initiate a mode transition.
+As already described in \cref{sec:pim_architecture}, \aca{fimdram} expects certain sequences of \ac{act} and \ac{pre} commands to initiate a mode transition.
 Unfortunately, Samsung did not specify this mechanism in any more detail than that, so the actual implementation of the mode switching in the \aca{fimdram} model has been simplified to a \ac{json}-based communication protocol, to achieve maximum flexibility and debugging ability from a development perspective.
 In this mechanism, the host processor builds \ac{json} messages at runtime and writes its raw serialized string representation to a predefined location in memory.
 The \ac{dram} model then inspects incoming \ac{wr} commands in this memory region and deserializes the content of these memory accesses to reconstruct the message of the host.
@@ -29,8 +29,8 @@ This is equivalent to the real \aca{fimdram} implementation, where the global \a
 \subsubsection{Implementation}
 So far, only the additional infrastructure in the \ac{dram} model of DRAMSys for the integration of the processing units have been described.
-Now follows the implementation of the processing units themselves.
+The next step is the implementation of the processing units themselves.
-The internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
+A processing unit's internal state consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
 As a simplification of the model, the \acp{crf} are not stored in each \ac{pim} unit, but are stored once globally for each \ac{pch}.
 Functionally, this does not change the behavior of the system, assuming that each processing unit is programmed with the same microkernel, which is the case for all the programs examined in this thesis.
@@ -40,8 +40,8 @@ While the former takes the address and the bank data to be read as input, the la
 However, both methods execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit.
 The \texttt{execute\_read} method begins with calculating the register indices used by the \ac{aam} execution mode followed by a branch table that dispatches to the handler of the current instruction.
 In case of the EXIT control instruction, the internal state of the processing unit is reset to its default configuration.
-The data movement instructions MOV and FILL both only perform a simple move operation that loads to value of one register or the bank data and assigns it to the destination register.
+The MOV and FILL data movement instructions both perform a simple move operation that loads the value of a register or the bank data and assigns it to the destination register.
-A more complex implementation require the four arithmetic instructions ADD, MUL, \ac{mac} and \ac{mad}:
+A more complex implementation requires the four arithmetic instructions ADD, MUL, \ac{mac}, and \ac{mad}:
 Depending on the \ac{aam} flag set in the instruction format, as seen in \cref{tab:isa}, either the indices set by the instruction itself will be used, or the ones previously calculated from the row and column address of the memory access.
 In the case of the simple ADD and MUL instructions, the operand data is then fetched from their respective sources.
 The \ac{mac} and \ac{mad} instructions differ in the sense that they require a total of three input operands, where one of which may be the destination register.
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -1,8 +1,8 @@
 \section{Processing-in-Memory}
 \label{sec:pim}
-In the conventional von Neumann architecture, compute is completely separated from memory.
+In the conventional von Neumann architecture, computation is completely separated from memory.
-Memory-intensive workloads operate on a large data set, have poor spatial and temporal locality, and low operational density.
+Memory intensive workloads that operate on a large set of data, have poor temporal locality, and have a low density of operations.
 As a consequence, the data movement between the memory and compute forms the so-called von Neumann bottleneck \cite{zou2021}.
 In the past, this bottleneck was obfuscated using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}.
 However, new memory-intensive applications, including \acp{dnn}, have led researchers to reconsider \ac{pim} as a new approach to meet future processing demands.
@@ -31,8 +31,8 @@ This process is illustrated in \cref{img:dnn} where one \ac{dnn} layer is proces
 \end{figure}
 Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
-Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
+Since a matrix element is used only once in the calculation of the output vector, there is no data reuse of the matrix.
-Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
+Moreover, since the weight matrices tend to be too large to fit into the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
 As a result, such an operation is a good fit for \ac{pim}.
 In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons:
 Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient.
@@ -82,7 +82,7 @@ In the following, three \ac{pim} approaches that place the compute units at the
 The first publicly available real-world \ac{pim} architecture has been designed and built by the company UPMEM \cite{gomez-luna2022}.
 UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-enabled UPMEM \acp{dimm} consisting of several \ac{pim} chips.
-In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mebi\byte}$ memory bank, a $\qty{24}{\kibi\byte}$ instruction memory and a $\qty{64}{\kibi\byte}$ scratchpad memory.
+In each \ac{pim} chip, there are eight \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mebi\byte}$ memory bank, a $\qty{24}{\kibi\byte}$ instruction memory and a $\qty{64}{\kibi\byte}$ scratchpad memory.
 The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results.
 While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices.
 UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer.
@@ -90,7 +90,7 @@ UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memo
 Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline.
 The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library.
 With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$.
-A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}, which gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
+A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}, which results in a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}.
 \subsection{Newton AiM}
 \label{sec:pim_newton}
@@ -137,7 +137,7 @@ Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows
 At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}.
 They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}.
-This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit.
+This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path of the inputs, outputs, and temporal results within the \ac{pim} unit.
 \begin{figure}
 	\centering
@@ -162,7 +162,7 @@ As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\ti
 	\item \textbf{\Ac{abp} Mode}:
 	      With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
 	      In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
-	      In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
+		  In addition, the \ac{io} circuits of the \ac{dram} for the data bus are completely disabled in this mode, which reduces the power consumption during \ac{pim} operation.
 \end{enumerate}
 Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
@@ -239,8 +239,8 @@ Another special field \textit{A} enables the \ac{aam}, which will be explained i
 	\label{tab:instruction_set}
 \end{table}
-The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
+\Cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations.
-It is to note, that some operations do require specifically either a \ac{rd} or a \ac{wr} access to execute properly.
+Note that some operations specifically require either a \ac{rd} or a \ac{wr} access to execute properly.
 For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank.
 Likewise, reading from the banks, requires a \ac{rd} command.
 For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction.
@@ -252,7 +252,8 @@ For the rest of this thesis, it is assumed, that a \ac{rd} is issued for these i
 Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel.
 When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to.
 This means that the order of the respective memory accesses for such instructions is important and must not be reordered, as it must match the corresponding instruction in the microkernel.
-For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for their actual memory access to receive the row and column address of the bank access.
+For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as one of the operand sources already specify the respective register index.
 However, they must wait for their actual memory access to get the row and column address of the bank access.
 \begin{listing}
 \begin{verbatim}
@@ -269,7 +270,7 @@ One solution to this problem would be to introduce memory barriers between each
 However, this comes at a significant performance cost and results in memory bandwidth being underutilized because the host processor has to wait for every memory access to complete.
 Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance.
-To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions.
+To solve this overhead, Samsung has introduced the \acf{aam} mode for arithmetic instructions.
 In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}.
 With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses.
@@ -328,7 +329,7 @@ Those matrix row blocks possibly span over multiple \ac{dram} rows or even other
 % This does not mean that a matrix row must be the same size as a \ac{dram} row, only that the \ac{am} of the memory controller must switch to the next bank after a complete matrix row.
 % Once all banks have been accessed, the mapping of the column bits can continue.
 Furthermore, the number of columns defines the number of iterations the \ac{mac} core of the microkernel has to perform.
-As always 16 \ac{fp16} elements are packed together in a column-major fashion, and while ensuring that the \ac{am} of the memory controller switches to the next bank after exactly one burst size, the \ac{pim} units each contain 16 different matrix row elements of the same set of 16 matrix columns.
+As 16 \ac{fp16} elements each are packed together in a column-major fashion, and while ensuring that the memory controller's \ac{am} switches to the next bank after exactly one burst size, the \ac{pim} units each contain 16 different matrix row elements of the same set of 16 matrix columns.
 \Cref{img:matrix_layout} gives a complete overview of the layout of the weight matrix in the linear address space and its mapping onto the memory banks.
 Note, that the interleaving of \ac{fp16} vectors is very similar to the chunking of the weight matrix of SK Hynix's Newton architecture, as illustrated in \cref{img:hynix}.
@@ -336,7 +337,7 @@ The input vector must adhere also a special memory layout.
 Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory.
 However, because all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel.
 As there is no communication between the banks, every bank needs to have its own copy of the input vector.
-Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
+Consequently, from the perspective of the linear address space, multiple copies of chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank.
 This interleaving is illustrated in \cref{img:input_vector}.
 \begin{figure}
@@ -397,7 +398,7 @@ This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that is integra
 Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel.
 Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$.
-The power consumption of the \aca{fimdram} dies itself is with $\qty{5.4}{\percent}$ higher than that of regular \aca{hbm}.
+The \aca{fimdram} dies themselves consume $\qty{5.4}{\percent}$ more power than regular \aca{hbm} dies.
 However, the increased processing bandwidth and the reduced power consumption on the global \ac{io}-bus led to a $\qty{8.25}{\percent}$ higher energy efficiency for a \ac{gemv} kernel, and $\qtyrange{1.38}{3.2}{\times}$ higher efficiency for real \ac{dnn} layers.
 In conclusion, \aca{fimdram} is one of the few real \ac{pim} implementations by hardware vendors at this time and promises significant performance gains and higher power efficiency compared to regular \aca{hbm} \ac{dram}.