From a49d409d4cbafa74b86172c4ca566e5c3725786c Mon Sep 17 00:00:00 2001 From: Derek Christ Date: Sat, 17 Feb 2024 13:02:14 +0100 Subject: [PATCH] Fixes in PIM and VP chapter --- src/chapters/conclusion.tex | 2 +- src/chapters/implementation.tex | 4 +- src/chapters/implementation/kernel.tex | 12 ++-- src/chapters/implementation/vm.tex | 4 +- src/chapters/pim.tex | 81 +++++++++++++------------- src/chapters/vp.tex | 10 ++-- 6 files changed, 56 insertions(+), 57 deletions(-) diff --git a/src/chapters/conclusion.tex b/src/chapters/conclusion.tex index fa9dca2..f3c2399 100644 --- a/src/chapters/conclusion.tex +++ b/src/chapters/conclusion.tex @@ -8,7 +8,7 @@ A working \ac{vp} of \aca{fimdram}, in the form of a software model, was develop It was found that, ... (TODO: hier Ergebnisse). However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work. -First, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode-switching and transferring of the microkernels. +Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units to implement the mode switching and transferring of the microkernels. For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}. To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel. In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications. diff --git a/src/chapters/implementation.tex b/src/chapters/implementation.tex index e312224..b856407 100644 --- a/src/chapters/implementation.tex +++ b/src/chapters/implementation.tex @@ -2,8 +2,8 @@ \label{sec:implementation} The implementation of the \aca{fimdram} model is divided into three distinct parts: -First, the processing units in the \acp{pch} of \aca{hbm} are integrated into the \ac{dram} model of DRAMSys. -Second, a software library that uses the \ac{pim} implementation provides a \ac{api} to take advantage of in-memory processing from a user application. +Firstly, the processing units in the \acp{pch} of \aca{hbm} are integrated into the \ac{dram} model of DRAMSys. +Secondly, a software library that uses the \ac{pim} implementation provides a \ac{api} to take advantage of in-memory processing from a user application. Finally, the software library is used in a gem5-based bare-metal kernel to perform \ac{pim} operations. \input{chapters/implementation/vm} diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex index 375fb05..5287650 100644 --- a/src/chapters/implementation/kernel.tex +++ b/src/chapters/implementation/kernel.tex @@ -28,7 +28,7 @@ Running a gem5 simulation requires writing a Python script, that sets up all sys Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library, which provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail. This standard library greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model. However, as of writing this thesis, gem5 does not provide a board abstraction suitable for bare-metal workloads. -Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e. the board only boots the provided operating system kernel. +Therefore, it was necessary to modify the provided ARM board for full system Linux simulations and simplify it in such a way, so that no disk image is required, i.e., the board only boots the provided operating system kernel. \subsubsection{Boot Code} At startup on an ARM processor, the reset handler cannot directly dispatch the \texttt{main} function to the application. @@ -40,8 +40,8 @@ During the initialization phase, gem5 ensures that the boot code \texttt{.init} The linker script also maps the \texttt{.text}, the \texttt{.data}, the \texttt{.rodata} and the \texttt{.bss} sections into the \ac{dram} region. Furthermore, it reserves space for the stack on the \ac{dram} and maps two special \aca{fimdram} regions: -First, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel. -Second, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on. +Firstly, the config region, where the processor writes the \ac{json} messages that switch the execution mode of the \ac{pim} units or transfer the microkernel. +Secondly, a large \ac{pim} region where all allocated arrays, vectors, and matrices are placed for the processing units to operate on. This segmentation of the \ac{dram} region is important because otherwise no memory access would be possible during \ac{ab} or \ac{abp} mode to fetch instruction data or store stack variables. Consequently, the default memory region and the \ac{pim} memory region are located on different \acp{pch} to guarantee this independence from each other. @@ -83,7 +83,7 @@ The \ac{uart} device model in gem5 then redirects the written messages either to Further, the bare-metal environment does not support any heap allocation without the kernel explicitly implementing it. During development of the custom kernel, it was found that the stack is not suitable for storing the large \ac{pim} arrays for two reasons: -First, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region. +Firstly, as the \ac{pim} arrays become very large with high matrix dimension numbers and may not fit in the preallocated stack region. Secondly, and most importantly, because the stack resides in the normal, cacheable \ac{dram} region, it cannot be used to store the \ac{pim}-enabled data structures. As an alternative, it would be possible to preallocate also the \ac{pim} data structures in the \ac{pim} \ac{dram} region by instructing the linker to place these structures in a special section of the \ac{elf} file, and mapping that section to the \ac{pim}-enabled \acp{pch}. However, this approach is very unflexible, as the exact dimensions of the matrices would have to be known at compile time. @@ -181,7 +181,7 @@ EXIT \label{lst:gemv_microkernel} \end{listing} -First, the input vector is loaded into all eight \ac{grf}-A registers, followed by the \ac{mac} core, which iteratively multiplies chunks of a matrix row with the input vector chunks and stores them in the first \ac{grf}-B register. +Firstly, the input vector is loaded into all eight \ac{grf}-A registers, followed by the \ac{mac} core, which iteratively multiplies chunks of a matrix row with the input vector chunks and stores them in the first \ac{grf}-B register. Then, the FILL instruction writes the computed partial sum into the memory bank, followed by an EXIT instruction that resets the processing units to a defined state. Note that even though the microkernel consists of only 12 instructions, the host processor has to send in total 36 memory requests to the memory. On the one hand because of the JUMP instruction, which is not executed itself, but repeats the previous instruction 7 times, and on the other hand because the memory requests have to be sent to both \ac{pch} which effectively executes the microkernel twice. @@ -197,5 +197,5 @@ By using special instructions that the processor model interprets, it is possibl Another option is to generate memory accesses at special predefined addresses, which the processor then interprets in a certain way. These special instructions or memory accesses for exiting the simulation, resetting the statistics, and dumping the statistics are then inserted into the kernel as follows: Before executing the microkernel of a benchmark, the simulation statistics are reset, while after execution they are explicitly dumped, measuring only the execution of the microkernel. -To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e. the core, is measured, not the initialization. +To compare the use of \aca{fimdram} with conventional matrix operations on the host processor, only the computation itself, i.e., the core, is measured, not the initialization. This provides a fair basis for comparison and allows a number of comparative simulations to be performed. diff --git a/src/chapters/implementation/vm.tex b/src/chapters/implementation/vm.tex index d22cfd5..4ae3afe 100644 --- a/src/chapters/implementation/vm.tex +++ b/src/chapters/implementation/vm.tex @@ -14,7 +14,7 @@ As already described in \cref{sec:pim_architecture}, \aca{fimdram} expects certa Unfortunately, Samsung did not specify this mechanism in any more detail than that, so the actual implementation of the mode switching in the \aca{fimdram} model has been simplified to a \ac{json}-based communication protocol, to achieve a maximum flexibility and debugging ability from a development perspective. In this mechanism, the host processor builds \ac{json} messages at runtime and writes the raw serialized string representation of it to a pre-defined location in memory. The \ac{dram} model then inspects incoming \ac{wr} commands in this memory region and deserializes the content of these memory accesses to reconstruct the message of the host. -As a downside of this method, the actual mode switching can not be simulated with accurate timing, as a \ac{json} message might be composed of more than one memory packet. +As a downside of this method, the actual mode switching cannot be simulated with accurate timing, as a \ac{json} message might be composed of more than one memory packet. With more information from Samsung on how the actual mechanism is implemented, this implementation can be trivially switched over to it at a later date. When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index. @@ -50,7 +50,7 @@ Note that while the MAC instruction can iteratively add to the same destination As already seen in \cref{sec:memory_layout}, the host processor is responsible for reducing these 16 floating point numbers to one. After the execution of one instruction, the program counter is incremented. One special instruction, the JUMP instruction, is processed at the end of an execution step. -The JUMP instruction is a zero-cycle instruction, i.e. it is not actually executed normally by triggering it with a \ac{rd} command. +The JUMP instruction is a zero-cycle instruction, i.e., it is not actually executed normally by triggering it with a \ac{rd} command. Instead, the jump offset and iteration count are resolved statically at the end of a regular instruction. Depending on the jump counter of the processing unit, the counter is either initialized with the jump count specified in the instruction, or it is decremented by one. If the new jump counter has not reached zero, the jump to the offset instruction will be performed. diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index 69fafe5..c01dad3 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -4,12 +4,12 @@ In the conventional von Neumann architecture, compute is completely separated from memory. Memory-intensive workloads operate on a large data set, have poor spatial and temporal locality, and low operational density. As a consequence, the data movement between the memory and compute forms the so-called von Neumann bottleneck \cite{zou2021}. -In the past, this bottleneck was hidden using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}. +In the past, this bottleneck was obfuscated using latency hiding techniques such as out-of-order execution, branch prediction, and multiple layers of cache \cite{radojkovic2021}. However, new memory-intensive applications, including \acp{dnn}, have led researchers to reconsider \ac{pim} as a new approach to meet future processing demands. -First proposals for \ac{pim} date back to the 1970s, were hindered by the limitations of existing memory systems, but are now experiencing a renaissance \cite{radojkovic2021,ghose2019a}. +First proposals for \ac{pim} date back to the 1970s and were hindered by the limitations of existing memory systems, but are now experiencing a renaissance \cite{radojkovic2021,ghose2019a}. In the following, the workloads suitable for \ac{pim} will be discussed in more detail, followed by an overview of the different types of \ac{pim} implementations. -Finally, a number of concrete examples are presented. +Finally, a number of concrete implementation examples are presented. \subsection{Applicable Workloads} \label{sec:pim_workloads} @@ -34,7 +34,7 @@ Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix. Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}. As a result, such an operation is a good fit for \ac{pim}. -In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons. +In contrast, a \acs{gemm} \ac{blas} routine, i.e., the multiplication of two matrices, is not such a good candidate for \ac{pim} for two reasons: Firstly, \ac{gemm} sees significant data reuse of both matrices as they are repeatedly accessed column-wise or row-wise, rendering the on-chip cache more efficient. Secondly, \ac{pim} comes with the further limitation that it can only accelerate two-input-one-output operations, where one operand is significantly larger than the other, as the computation of \ac{pim} can only be close to one of the operands, resulting in extensive data movement of the other operand \cite{he2020}. @@ -42,7 +42,7 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate \label{sec:pim_architectures} Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors. -These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level. +These proposals differ largely in the positioning of the processing operation applied, ranging from the analog distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level. In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}: \begin{enumerate} @@ -58,24 +58,24 @@ Only when the compute units are placed within the bank region, the full bank par Outside the bank region, the data retrieval is limited by the narrow memory bus. On the other hand, the integration of the \ac{pim} units inside the bank becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}. -Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row. +Processing \textbf{inside the \ac{subarray}} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row. It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it. The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture. An example of such an approach is Ambit \cite{seshadri2020}. Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data. -Far fewer, but still challenging, constraints are placed on the integration of compute units in the region of the \acp{psa}. -\cite{sudarshan2022a} presents a two-stage design that integrates current mirror-based analogue units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications possible. +Far fewer, but still challenging, constraints are placed on the integration of compute units \textbf{in the region of the \acp{psa}}. +The approach presented in \cite{sudarshan2022a} consists of a two-stage design that integrates current mirror-based analog units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications. -The integration of compute units in the \ac{io} region of the bank allows for area intensive operations such as ADD, \ac{mac} or \ac{mad} possible. +The integration of compute units \textbf{in the \ac{io} region of the bank} allows for area intensive operations such as ADD, \ac{mac} or \ac{mad}. This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units. However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks. -Placing the compute units in the \ac{io} region of the \ac{dram} has the fewest physical limitations and allows for complex accelerators possible. +Placing the compute units \textbf{in the \ac{io} region of the \ac{dram}} has the fewest physical limitations and allows for complex accelerators, implementing a complete \ac{isa}. The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level. Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches. -In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail. +In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are presented in more detail. \subsection{UPMEM} \label{sec:pim_upmem} @@ -85,24 +85,23 @@ UPMEM combines regular DDR4 \ac{dimm} based \ac{dram} with a set of \ac{pim}-ena In each \ac{pim} chip, there are of 8 \acp{dpu}, each of which has exclusive access to a $\qty{64}{\mega\byte}$ memory bank, a $\qty{24}{\kilo\byte}$ instruction memory and a $\qty{64}{\kilo\byte}$ scratchpad memory. The host processor can access the \ac{dpu} memory banks to copy input data from main memory and retrieve results. While copying, the data layout must be changed to store the data words continuously in a \ac{pim} bank, in contrast to the horizontal \ac{dram} mapping used in \ac{dimm} modules, where a data word is split across multiple devices. -UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout. +UPMEM provides a \ac{sdk} that orchestrates the data movement from the main memory to the \ac{pim} banks and modifies the data layout without special attention of the developer. Each \ac{dpu} is a multithreaded $\qty{32}{bit}$ \ac{risc} core with a full set of general purpose registers and a 14-stage pipeline. The \acp{dpu} execute compiled \acs{c} code using a specialized compiler toolchain that provides limited support of the standard library. With a system clock of $\qty{400}{\mega\hertz}$, the internal bandwidth of a \ac{dpu} amounts to $\qty[per-mode = symbol]{800}{\mega\byte\per\second}$. -A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}. -This gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}. +A system can integrate 128 \acp{dpu} per \ac{dimm}, with a total of 20 UPMEM \acp{dimm}, which gives a maximum theoretical \ac{pim} bandwidth of $\qty[per-mode = symbol]{2}{\tera\byte\per\second}$ \cite{gomez-luna2022}. \subsection{Newton AiM} \label{sec:pim_newton} -In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology in \ac{gddr6} memory called Newton \cite{he2020}. +In the year 2020, the major \ac{dram} manufacturer SK Hynix announced its own \ac{pim} technology using \ac{gddr6} memory called Newton \cite{he2020}. In contrast to UPMEM, Newton integrates only small \ac{mac} units and buffers into the bank region to avoid the area and power overhead of a fully programmable processor core. To communicate with the processing units, Newton introduces its own \ac{dram} commands, allowing fully interleaved \ac{pim} and non-\ac{pim} traffic as no mode switching is required. -Another advantage of this approach is that there is no kernel startup delay used to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations. +Another advantage of this approach is that there is no kernel startup delay required to initialize the \ac{pim} operation, which would be a significant overhead for small batches of \ac{pim} operations. On the downside, this extension to the \ac{jedec} standard is not a drop-in solution, as the memory controller, and consequently the host processor, must be specifically adapted. In addition to the \ac{mac} units, Newton also introduces a shared global buffer in the \ac{io} region of the memory to broadcast the same input vector to all banks. -The broadcasted input vector is then multiplied by a matrix row by doing a column access to the \ac{dram} bank, producing a $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values. +The broadcasted input vector is then multiplied by a matrix row by performing a column access to the \ac{dram} bank, producing $\qty{32}{\byte}$ wide temporary products of 16 16-bit floating point values. These temporary products are then reduced to a single output vector element by the adder tree in the bank. To make full use of the output buffering, the matrix rows are interleaved in an unusually wide data layout, corresponding to the row size of the \ac{dram}. @@ -131,12 +130,12 @@ The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021} \subsubsection{Architecture} \label{sec:pim_architecture} -As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}. -A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms. -Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic. -Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory map into a \ac{pim}-enabled region and a normal \ac{hbm} region. +As the name of \aca{fimdram} suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while preserving the highly optimized \acp{subarray} \cite{kwon2021}. +A major difference from Newton \ac{pim} is that \aca{fimdram} does not require any changes to components of modern processors, such as the memory controller, i.e., it is agnostic to existing \aca{hbm} platforms. +Consequently, mode switching is required for \aca{fimdram}, making it less useful for interleaved \ac{pim} and non-\ac{pim} traffic and small batch sizes. +Fortunately, as discussed in \cref{sec:hbm}, the architecture of \ac{hbm} allows for many independent memory channels on a single stack, making it possible to cleanly separate the memory into a \ac{pim}-enabled region and a normal \ac{hbm} region. -At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}. +At the heart of the \aca{fimdram} are the \ac{pim} execution units, which are shared by two banks each of a \ac{pch}. They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}. This general architecture is shown in detail in \cref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit. @@ -149,8 +148,8 @@ This general architecture is shown in detail in \cref{img:fimdram}, with (a) the As it can be seen in (c), the input data to the \ac{fpu} can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation. The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{ssa} to the \acp{fpu} from a single memory access. -As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}. -As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{16}{\times}$ higher than the external bus bandwidth to the host processor. +As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}. +As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor. \Ac{hbm}-\ac{pim} defines three operating modes: \begin{enumerate} @@ -191,7 +190,7 @@ This processing unit architecture is illustrated in \cref{img:pcu}, along with t \end{figure} To emphasize the architectural differences, unlike SK Hynix's Newton architecture, \aca{fimdram} requires both mode switching and loading a microkernel into the processing units before a workload can be executed. -This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization is significant. +This makes \aca{fimdram} less effective for very small workloads, as the overhead of the mode switching and initialization would be significant. \subsubsection{Instruction Set} @@ -206,15 +205,15 @@ The data layout of these three instruction groups is shown in \cref{tab:isa}. \end{table} For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit. -It is important to note that the JUMP instruction is a zero-cycle instruction, i.e. it is executed together with the instruction that precedes it. +It is important to note that the JUMP instruction is a zero-cycle instruction, i.e., it is executed together with the instruction that precedes it. The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications. Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files. The DST and SRC fields specify the operand type. That is, the register file or bank affected by the operation. Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols. -The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location. -Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}. +The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., the clamping of negative values to zero, while the data is moved to another location. +Another special field \textit{A} enables the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}. \begin{table} \centering @@ -241,19 +240,19 @@ Another special field \textit{A} enabled the \ac{aam}, which will be explained i \end{table} The \cref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations. -It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly. +It is to note, that some operations do require specifically either a \ac{rd} or a \ac{wr} access to execute properly. For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank. Likewise, reading from the banks, requires a \ac{rd} command. For the control types and arithmetic instructions without the bank as a source operand, either a \ac{rd} or a \ac{wr} can be issued to execute the instruction. -The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions. +For the rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions. \subsubsection{Instruction Ordering} \label{sec:instruction_ordering} Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel. When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to. -This means that the order of the respective memory accesses for such instructions must not be reordered, as it must match the corresponding instruction in the microkernel. -For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for the actual memory access to get the row and column address of the bank access. +This means that the order of the respective memory accesses for such instructions is important and must not be reordered, as it must match the corresponding instruction in the microkernel. +For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as of the one operand source already specify the respective register index, but must wait for their actual memory access to receive the row and column address of the bank access. \begin{listing} \begin{verbatim} @@ -267,10 +266,10 @@ MAC GRF_B #1, BANK, GRF_A #1 Unfortunately, the memory controller between the host processor and the \ac{pim} memory is allowed to reorder memory fetches as long as they do not introduce hazards. This causes the register sources and destinations to be out of sync with the bank addresses. One solution to this problem would be to introduce memory barriers between each \ac{ld} and \ac{st} instruction of the processor, to prevent any reordering, as only one memory transaction is handled by the controller at a time. -However, this comes at a significant performance cost and results in memory bandwidth being underutilized as the host processor has to wait for every memory access to complete. +However, this comes at a significant performance cost and results in memory bandwidth being underutilized because the host processor has to wait for every memory access to complete. Disabling memory controller reordering completely, on the other hand, interferes with non-\ac{pim} traffic and significantly reduces its performance. -To solve this overhead, Samsung has implemented the \ac{aam} mode for arithmetic instructions. +To solve this overhead, Samsung has introduced the \ac{aam} mode for arithmetic instructions. In the \ac{aam} mode, the register indices of an instruction are ignored and decoded from the column and row address of the memory access itself, as demonstrated in \cref{img:aam}. With this method, the register indices and the bank address cannot get out of sync, as they are tightly coupled, even if the memory controller reorders the order of the accesses. @@ -295,7 +294,7 @@ JUMP -1, 7 Since the column address of the memory access is incremented after each iteration, all entries of the \ac{grf}-A register file, where the input vector is stored, are used to multiply it with the matrix weights loaded on the fly from the memory banks. The actual order of the memory accesses is irrelevant, only before and after the \ac{mac} kernel the host must place memory barrier instructions to synchronize the execution again. -To achieve this particular operation, where the addresses can be used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern. +To achieve this particular operation, where the addresses are used to calculate the register indices, the memory layout of the weight matrix has to follow a special pattern. This memory layout is explained in detail in \cref{sec:memory_layout}. \subsubsection{Programming Model} @@ -305,10 +304,10 @@ Firstly, a \ac{pim} device driver is responsible for allocating buffers in \ac{h It does this because the on-chip cache would add an unwanted filtering between the host processors \ac{ld} and \ac{st} instructions and the generation of memory accesses by the memory controller. Alternatively, it would be possible to control cache behavior by issuing flush and invalidate instructions, but this would introduce an overhead as the flush would have to be issued between each and every \ac{pim} instruction in the microkernel. Secondly, a \ac{pim} acceleration library implements a set of \ac{blas} operations and manages the generation, loading and execution of the microkernel on behalf of the user. -At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation. +At the highest level, \aca{fimdram} provides an extension to the \ac{tf} framework that allows for either calling the special \ac{pim} operations implemented by the accelerator library directly on the source operands, or for automatically finding suitable routines that can be accelerated by \ac{pim} in the normal \ac{tf} operation. The software stack is able to concurrently exploit the independent parallelism of \acp{pch} for a \ac{mac} operation as described in \cref{sec:instruction_ordering}. -Since \aca{hbm} memory is mainly used in conjunction with \acs{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously. +Since \aca{hbm} memory is mainly used in conjunction with \acp{gpu}, which do not implement sophisticated out-of-order execution, it is necessary to spawn a number of software threads to execute the eight memory accesses simultaneously. The necessary number of threads depends on the processor \ac{isa}, e.g., with a maximum access size of $\qty{16}{\byte}$, $\qty{256}{\byte}/\qty{16}{\byte}=\num{16}$ threads are required to access the full \aca{hbm} burst size. Such a group of software threads is called a thread group. Thus, a total of 64 thread groups running in parallel can be spawned in a \ac{hbm} configuration with four memory stacks and a total of 64 \acp{pch}. @@ -334,7 +333,7 @@ Note, that this interleaving of \ac{fp16} vectors is very similar to the chunkin The input vector must adhere also a special memory layout. Since a vector is essentially a single-column matrix, it is always laid out sequentially in memory. -However, since all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel. +However, because all processing units must access the same input vector elements at the same time, all processing units must load the respective vector elements into their \ac{grf}-A registers during the initialization phase of the microkernel. As there is no communication between the banks, every bank needs to have its own copy of the input vector. Consequently, from the perspective of the linear address space, multiple copies chunks of the input vector must be interleaved in such a way that the input vector is continuous from the perspective of each bank. This interleaving is illustrated in \cref{img:input_vector}. @@ -360,7 +359,7 @@ psum[i,0:15]=\sum_{j=0}^{8}(a[j \cdot 16:j \cdot 16+15] \cdot w[i,j \cdot 16:j \ The partial sum vector $psum[0:7,0:15]$ must then be reduced by the host processor to obtain the final output vector $b[0:7]$. This reduction step is mandatory because there is no means in the \aca{fimdram} architecture to reduce the output sums of the 16-wide \ac{simd} \acp{fpu}. In contrast, SK Hynix's Newton implements adder trees in the \ac{pim} units to reduce the partial sums directly in memory. -Note that consequently the activation function often used in \acp{dnn}, i.e. \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function. +Note that consequently the activation function often used in \acp{dnn}, i.e., \ac{relu} in the case of \aca{fimdram}, cannot be applied without first reducing the partial sums, since the \ac{relu} operation is a non-linear function. The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img:memory_layout}. \begin{figure} @@ -385,13 +384,13 @@ JUMP -1, 63 \end{listing} To increase the number of columns, new entries of the input vector must be loaded into the processing units. -Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times the different input vector chunks and weight matrix columns. +Therefore, it is necessary to execute the complete \ac{gemv} microkernel several times with different input vector chunks and weight matrix columns. In general, the more the dimensions exceed the native \ac{pim} matrix dimensions, the more often the \ac{mac} core of the \ac{gemv} microkernel must be executed. \subsubsection{Performance and Power Efficiency Effects} In addition to the theoretical bandwidth that is provided to the \ac{pim} units of $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or a total of $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}, Samsung also ran experiments on a real implementation of \aca{fimdram} to analyze its performance gains and power efficiency improvements. -This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that lies on the same silicon interposer as four \aca{hbm} stacks with each one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}. +This real system is based on a Xilinx Zynq Ultrascale+ \ac{fpga} that is integrated onto the same silicon interposer as four \aca{hbm} stacks, with each consisting of one buffer die, four \aca{fimdram} dies and four normal \aca{hbm} dies \cite{lee2021}. Results promise performance gains in the range of $\qtyrange{1.4}{11.2}{\times}$ in the tested microbenchmarks, with the highest gain of $\qty{11.2}{\times}$ for a \ac{gemv} kernel. Real layers of \acp{dnn} achieved a performance gain in the range of $\qtyrange{1.4}{3.5}{\times}$. diff --git a/src/chapters/vp.tex b/src/chapters/vp.tex index ec7bc67..405a365 100644 --- a/src/chapters/vp.tex +++ b/src/chapters/vp.tex @@ -8,7 +8,7 @@ In addition, the suitability of different applications for \ac{pim} can be evalu \subsection{Virtual Prototypes} To perform such simulations, it is necessary to use a simulation model, commonly referred to as a \ac{vp}. -\Acp{vp} act as executable software models of a physical hardware system, allowing the architecture of the system to be completely simulated in software. +A \ac{vp} acts as an executable software model of a physical hardware system, allowing the architecture of the system to be completely simulated in software. This in turn enables the software development and the identification of potential platform-specific software bugs without the need for the actual hardware implementation \cite{antonino2018}. \Acp{vp} provide full visibility and control over the entire simulated system, helping to identify bottlenecks and potential specification errors in the design. They also allow the exploration of the design space, for example, in the case of \aca{fimdram}, this includes the variation of the ratio of \ac{pim} units to the number of memory banks and the effect on the performance of the \ac{pim} microkernel. @@ -24,8 +24,8 @@ Two different \ac{vp} simulation frameworks used in the implementation of the \a The gem5 simulator is an open-source computer architecture simulation platform used for system-level architecture research \cite{lowe-power2020}. This powerful platform allows the measurement of various statistics, including runtime, memory bandwidth, and internal processor metrics across different hardware configurations. -The gem5 simulator runs a user application and simulates it with it's sophisticated processor models with accurate timing. -It consists of a simulator core and parameterized models for many components, including out-of-order processors, bus systems, and \ac{dram}. +The gem5 simulator runs a user application and simulates it with its sophisticated processor models with accurate timing. +It consists of a simulator core and parameterized models for many components, including out-of-order processors, bus systems, and \acp{dram}. As a result, gem5 provides a comprehensive framework for simulating and analyzing complex computer systems. Two different modes can be used with gem5: full system simulation and system call emulation. @@ -40,7 +40,7 @@ An example of such an external model is the \ac{dram} simulator DRAMSys, which i DRAMSys is an open-source framework for design space exploration and provides the ability to simulate the latest \ac{jedec} \ac{dram} standards \cite{steiner2022a}. The framework is optimized for high simulation speed and uses the \ac{at} coding style, while ensuring cycle-accurate results. -\Cref{img:dramsys} provides an overview of the internal architecture of DRAMSys, which consists of a frontend, a backend and the memory models. +\Cref{img:dramsys} provides an overview of the internal architecture of DRAMSys, which consists of the frontend, the backend and the memory models. \begin{figure} \centering @@ -54,7 +54,7 @@ Each independent channel controller is responsible for controlling a single DRAM The scheduler, located within a channel controller, has the ability to reorder incoming requests to optimize for specific metrics. In conjunction with the response queue, requests can be completed out-of-order, improving overall system performance based on a specific metric. -At the frontend of DRAMSys, a variety of initiators can be connected, including traffic generators that generate random accesses, as well as sophisticated processor model such as gem5. +At the frontend of DRAMSys, a variety of initiators can be connected, including traffic generators that generate random accesses, as well as sophisticated processor models such as gem5. In cases where such a processor model is used to execute a user application, DRAMSys uses its internal memory model to store and retrieve the requested data, rather than ignoring the contents of the request. DRAMSys provides support for the latest \ac{jedec} \ac{dram} standards, including \aca{hbm}.