Implementation of the virtual machine

2024-02-14 17:32:23 +01:00
parent 89dffecdf0
commit c4f9383dad
7 changed files with 108 additions and 15 deletions
--- a/Tectonic.toml
+++ b/Tectonic.toml
@@ -5,5 +5,6 @@ bundle = "https://data1.fullyjustified.net/tlextras-2022.0r0.tar"
 [[output]]
 name = "doc"
 type = "pdf"
 shell_escape = true
 preamble = ""
 postamble = ""
--- a/src/abstract.tex
+++ b/src/abstract.tex
@@ -0,0 +1,12 @@
 \begin{abstract}
 \section*{Abstract}
 \vspace{1.0cm}
 \section*{Zusammenfassung}
 \end{abstract}
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -272,3 +272,19 @@
    short = API,
    long = application programming interface,
 }
 \DeclareAcronym{json}{
    short = JSON,
    long = JavaScript Object Notation,
 }
 \DeclareAcronym{sb}{
    short = SB,
    long = Single-Bank,
 }
 \DeclareAcronym{ab}{
    short = AB,
    long = All-Bank,
 }
 \DeclareAcronym{abp}{
    short = AB-PIM,
    long = All-Bank-PIM,
 }
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -93,7 +93,7 @@ One of these is device-based \ac{dram}, where the memory devices are directly so
 Another type is 2.5D-integrated \ac{dram}, where multiple memory dies are stacked on top of each other and connected to the \ac{mpsoc} by a silicon interposer \cite{jung2017a}.
 Such a 2.5D-integrated type used in \acp{gpu} and \acp{tpu} is \ac{hbm}, which will be introduced in greater detail in the following section.
-\subsection{\Acf{hbm}}
+\subsection{\Acl{hbm}}
 \label{sec:hbm}
 \Aca{hbm} is a \ac{dram} standard that has been defined by \ac{jedec} in 2016 as a successor of the previous \ac{hbm} standard \cite{jedec2015a}.
--- a/src/chapters/implementation/vm.tex
+++ b/src/chapters/implementation/vm.tex
@@ -1,4 +1,65 @@
 \subsection{Virtual Machine}
 \label{sec:vm}
-To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the address to calculate the register indices in the \ac{aam} operation mode.
+To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
 However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim}, no changes are required in the memory controller.
 In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
 As a result, the \aca{fimdram} model itself does not need to model any timing behavior: its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys.
 This leads to a significantly simplified model, since the internal pipeline stages of \aca{fimdram} do not need to be modeled, but only the functional behavior of a processing unit to the outside.
 While \aca{fimdram} is in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory.
 Only when the host initiates a mode switch of one of the \ac{pim}-enabled \acp{pch}, the processing units become active.
 As already described in \cref{sec:pim_architecture}, \aca{fimdram} expects certain sequences of \ac{act} and \ac{pre} sequences to initiate a mode transition.
 Unfortunately, Samsung did not specify this mechanism in any more detail than that, so the actual implementation of the mode switching in the \aca{fimdram} model has been simplified to a \ac{json}-based communication protocol, to achieve a maximum flexibility and debugging ability from a development perspective.
 In this mechanism, the host processor builds \ac{json} messages at runtime and writes the raw serialized string representation of it to a pre-defined location in memory.
 The \ac{dram} model then inspects incoming \ac{wr} commands in this memory region and deserializes the content of these memory accesses to reconstruct the message of the host.
 As a downside of this method, the actual mode switching can not be simulated with accurate timing, as a \ac{json} message might be composed of more than one memory packet.
 With more information from Samsung on how the actual mechanism is implemented, this implementation can be trivially switched over to it at a later date.
 When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index.
 This mode can be used by the host to initialize the input vector chunk interleaving as described in \cref{sec:memory_layout}, or to initialize the \ac{crf} of the processing unit with the microkernel, which should be the same for all operating banks.
 After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{ab}-\ac{pim} mode, which allows the execution of instructions in the processing units.
 The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units.
 In case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object.
 This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks.
 So far, only the additional infrastructure in the \ac{dram} model of DRAMSys and the integration of the processing units have been described.
 Now follows the implementation of the processing units themselves.
 The internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
 As a simplification of the model, the \acp{crf} are not stored in each \ac{pim} unit, but are stored once globally for each \ac{pch}.
 Functionally, this does not change the behavior of the system, assuming that each processing unit is programmed with the same microkernel, which is the case for all the programs examined in this thesis.
 Depending on a \ac{rd} or \ac{wr} command either the method \mint{rust}{execute_read(address: u64, bank_data: &[u8])} or the method \mint{rust}{execute_write() -> [u8; 32]} is called on the instance of a \ac{pim} unit.
 The most important difference between these two methods is their signatures.
 While the former takes the address and the bank data to be read as input, the latter only outputs the bank data of the size of a full burst to be written into the respective bank.
 However, both methods execute an instruction in the \ac{crf} and increment the program counter of the corresponding \ac{pim} unit.
 The \texttt{execute\_read} method begins with calculating the register indices used by the \ac{aam} followed by a branch table that dispatches to the handler of the current instruction.
 In case of the EXIT control instruction, the internal state of the processing unit is reset to its default configuration.
 The data movement instructions MOV and FILL both only perform a simple move operation that loads to value of one register or the bank data and assigns it to the destination register.
 A more complex implementation require the four arithmetic instructions ADD, MUL, MAC and MAD:
 Depending on the \ac{aam} flag set in the instruction format, as seen in \cref{tab:isa}, either the indices set by the instruction itself will be used, or the ones previously calculated from the row and column address of the memory access.
 In the case of the simple ADD and MUL instructions, the respective operand data is then fetched from their respective sources.
 The MAC and MAD instructions differ in the sense that they require a total of three input operands, one of which may be the destination register in the case of MAC.
 In the first step, the multiplication of the first two input operands is performed in the same way as in MUL.
 Then, this temporary product is added to the third source register as in ADD.
 Finally, this sum is written to the destination register.
 Note that while the MAC instruction can iteratively add to the same destination register, it does not reduce the 16-wide \ac{fp16} vector itself in any way.
 As already seen in \cref{sec:memory_layout}, the host processor is responsible for reducing these 16 floating point numbers to one.
 After the execution of one instruction, the program counter is incremented.
 One special instruction, the JUMP instruction, is processed at the end of an execution step.
 The JUMP instruction is a zero-cycle instruction, i.e. it is not actually executed normally by triggering it with a \ac{rd} command.
 Instead, the jump offset and iteration count are resolved statically at the end of a regular instruction.
 Depending on the jump counter of the processing unit, the counter is either initialized with the jump count specified in the instruction, or it is decremented by one.
 If the new jump counter has not reached zero, the jump to the offset instruction will be performed.
 If not, the execution continues as is.
 This implementation only works for non-nested JUMP instructions, as for each step of nesting would require a new jump counter.
 From the information provided by Samsung, it is not clear whether nested JUMP instructions are implemented in \aca{fimdram}.
 However, none of the microkernels examined in this thesis use nested jumps.
 As already seen in \cref{tab:instruction_set}, only the FILL instruction supports writing to the memory bank.
 Therefore, it is the only instruction implemented in the \texttt{execute\_write} method.
 It is semantically identical to the implementation in \texttt{execute\_read}, except that the moved data is not assigned to a register file, but returned by the method to be written into the memory bank.
 With this implementation of the processing units, it is now possible to write a user program that controls the execution of \ac{pim} operations directly in the memory model of DRAMSys.
 The next section introduces the support library that interacts with the \ac{pim} units at a low level and allows the user to take advantage of \aca{fimdram}.
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -122,7 +122,7 @@ Finally, the host reads the result latches from all banks at the same time and c
 Overall, Newton completes the arithmetic operations of a row in all banks in the time it takes a conventional DRAM to read a row from one bank \cite{he2020}.
 As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a theoretical non-\ac{pim} system with infinite computation, which is completely limited by the available memory bandwidth.
-\subsection{\Acf{fimdram}}
+\subsection{\Acl{fimdram}}
 \label{sec:pim_fim}
 One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}.
@@ -130,6 +130,7 @@ As this is the \ac{pim} architecture which was implemented during the work on th
 The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}.
 \subsubsection{Architecture}
 \label{sec:pim_architecture}
 As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}.
 A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms.
 Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic.
@@ -149,29 +150,29 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the
 As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation.
 The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access.
 As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}.
-As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor.
+As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor.
 \Ac{hbm}-\ac{pim} defines three operating modes:
 \begin{enumerate}
-	\item \textbf{Single Bank Mode}:
+	\item \textbf{\Ac{sb} Mode}:
 	      This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory.
 	      To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address.
-	\item \textbf{All-Bank Mode}:
+	\item \textbf{\Ac{ab} Mode}:
-	      The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
+	      The \ac{ab} mode is an extension of the \ac{sb} mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time.
 	      This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks.
-	\item \textbf{All-Bank-\ac{pim} Mode}:
+	\item \textbf{\Ac{abp} Mode}:
 	      With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode.
 	      In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units.
 	      In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation.
 \end{enumerate}
-Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
+Both in \ac{ab} mode and in \ac{ab}-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}.
 \subsubsection{Processing Unit}
 Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}.
-In addition, \ac{fp16} is well supported on modern processor architectures such as ARMv8.
+In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers.
-The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
+The \ac{simd} \ac{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations.
 In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}.
 The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access.
 One program that is stored in the \ac{crf} is called a \textit{microkernel}.
@@ -340,8 +341,8 @@ This interleaving is illustrated in \cref{img:input_vector}.
 	\label{img:input_vector}
 \end{figure}
-To initialize the input vector in this way, the host processor can use all-bank mode.
+To initialize the input vector in this way, the host processor can use \ac{ab} mode.
-From the processor's point of view, only the first bank is initialized, but the all-bank mode ensures that the same data is written to all banks at the same time.
+From the processor's point of view, only the first bank is initialized, but the \ac{ab} mode ensures that the same data is written to all banks at the same time.
 An example with a weight matrix of dimensions (128,8), an input vector of size (128), and an output vector of size (8) will be analyzed in the following to describe how the processing units execute a \ac{gemv} microkernel.
 With the processing unit \textit{i}, the number of iterations \textit{j}, the input vector \textit{a} and the weight matrix \textit{w}, the partial sum $psum[i,0:15]$ is calculated as follows:
--- a/src/index.tex
+++ b/src/index.tex
@@ -25,6 +25,7 @@
 \usepackage{mathdots}
 \usepackage{tabularray}
 \usepackage{makecell}
 \usepackage{minted}
 % Configurations
 \usetikzlibrary{matrix}
@@ -63,6 +64,7 @@
 % Title page
 \include{titlepage}
 \include{statement}
 \include{abstract}
 % Table of contents
 \tableofcontents