diff --git a/src/chapters/conclusion.tex b/src/chapters/conclusion.tex index ad419c6..a42b42a 100644 --- a/src/chapters/conclusion.tex +++ b/src/chapters/conclusion.tex @@ -13,7 +13,8 @@ This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely m In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared. However, there is still room for improvement in the software model and the comparison methodology, which will be the subject of future work. -Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, which is used to implement the mode switching and the transfer of the microkernels. +Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units. +This protocol is used to implement mode switching and to transfer the microkernels. For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}. To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel. diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex index 776a947..be40de4 100644 --- a/src/chapters/dram.tex +++ b/src/chapters/dram.tex @@ -34,7 +34,7 @@ The process of loading the stored values into the \acp{psa} is done for all colu Once a row is activated, it can be read from or written to with a certain access granularity determined by the \ac{bl} of the memory. To perform such a burst access, the \acp{csl} of a set of \acp{psa} are enabled, connecting them to the more powerful \acp{ssa} that drive the actual bank \ac{io}. Depending on the \ac{we} signal, the \acp{ssa} either sense and amplify the logic value of the \acp{psa}, or they overwrite it using the \textit{write drivers}. -The \cref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far. +\Cref{img:bank} summarizes the basic architecture of a single storage device consisting of a number of banks that has been discussed so far. \begin{figure} \centering @@ -44,7 +44,7 @@ The \cref{img:bank} summarizes the basic architecture of a single storage device \end{figure} Since a single \ac{dram} device has only a small bit-width, for example in the case of x8 \ac{dram} a width of 8, several devices operate in lockstep mode to form the wider \textit{data bus} of the \textit{memory channel} \cite{jung2017a}. -One kind of \ac{dram} subsystem places these sets of devices on a special \ac{pcb} is called \ac{dimm}. +One kind of \ac{dram} subsystem that places these sets of devices on a special \ac{pcb} is called \ac{dimm}. A \ac{dimm} may also consist of several independent \textit{ranks}, which are complete sets of \ac{dram} devices connected to the same data bus, but accessed in an interleaved manner. Besides the data bus, the channel consists also of the \textit{command bus} and the \textit{address bus}. @@ -109,7 +109,7 @@ Such a cube is then placed onto a common silicon interposer that connects the \a This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance. For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}. -One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups. +A memory stack supports up to 8 independent memory channels, each containing up to 16 banks divided into 4 bank groups. The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$. The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is. In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}. diff --git a/src/chapters/implementation.tex b/src/chapters/implementation.tex index 67419cb..212f66a 100644 --- a/src/chapters/implementation.tex +++ b/src/chapters/implementation.tex @@ -1,7 +1,7 @@ \section{Implementation} \label{sec:implementation} -The implementation of the \aca{fimdram} model is divided into three distinct parts: +The implementation of the \aca{fimdram} model is divided into three parts: Firstly, the processing units in the \acp{pch} of \aca{hbm} are integrated into the \ac{dram} model of DRAMSys. Secondly, a software library that uses the \ac{pim} implementation provides an \ac{api} to take advantage of in-memory processing from a user application. Finally, the software library is used in a gem5-based bare-metal kernel to perform \ac{pim} operations. diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex index 8b465f4..c476ebc 100644 --- a/src/chapters/implementation/kernel.tex +++ b/src/chapters/implementation/kernel.tex @@ -19,12 +19,12 @@ The user application does not have to run in user space, but can run in a privil \end{itemize} While the system call emulation mode is the simplest option, it has been discarded due to its lack of accuracy and inability to execute privileged instructions. -The full system mode, which boots a Linux kernel, on the one hand provides the necessary capability to implement the application. -However, due to the complexity of booting the entire kernel, which renders rapid prototyping unfeasible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option. +On the one hand, the full system mode, which boots a Linux kernel, provides the necessary capabilities for the implementation of the application. +However, due to the complexity of booting the entire kernel, which makes rapid prototyping nearly impossible, and the need to write a Linux device driver to execute privileged instructions and control the non-cacheable memory regions, it was decided to favor the bare-metal option. Here, the self-written kernel has full control over the complete system which is an advantage when implementing a minimal example utilizing \aca{fimdram}. On the other hand, some setup is required, such as initializing the page tables so that the \ac{mmu} of the processor can be enabled and programmed to mark memory regions as cacheable and non-cacheable. -Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them. +Running a gem5 simulation requires writing a Python script, that sets up all system components and connects them together. Recently, gem5 deprecated a commonly used prebuilt script called \texttt{fs.py} in favor of its new standard library. This standard library provides useful abstractions over common system components, making it easier to build complex systems in a flexible way without having to dive into great detail. It also greatly simplifies the process of building a system with, for example, an accurate timing or out-of-order processor, a multi-level cache hierarchy, a memory crossbar, and a \ac{dram} model. @@ -162,7 +162,7 @@ Since different channels would only be used to increase the dimensions of the ma With a working bare-metal environment, heap allocation of memory arrays, and the correct \aca{hbm} configuration for \aca{fimdram}, a \ac{gemv} microkernel can finally be assembled using the data structures provided by the \ac{pim} library. The native matrix dimensions of (128$\times$8) have been extended to (128$\times$16), spreading the matrix over two \acp{pch} and increasing the size of the output vector to (16). The microkernel must therefore execute on both \acp{pch}, which is ensured by implicitly addressing the corresponding \ac{pch} when generating the \ac{rd} and \ac{wr} commands for the matrix addresses. -With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets that holds the partial sums and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}. +With the (128$\times$16) weight matrix, the interleaved (128) input vector, the reserved (16) output vector of 16-wide \ac{fp16} \ac{simd} packets holding the partial sums, and a dummy memory region for executing control instructions, the \ac{gemv} microkernel can be assembled as seen in \cref{lst:gemv_microkernel}. \begin{listing} \begin{verbatim} @@ -186,7 +186,7 @@ EXIT Firstly, the input vector is loaded into all eight \ac{grf}-A registers, followed by the \ac{mac} core, which iteratively multiplies chunks of a matrix row with the input vector chunks and stores them in the first \ac{grf}-B register. Then, the FILL instruction writes the computed partial sum into the memory bank, followed by an EXIT instruction that resets the processing units to a defined state. Note that even though the microkernel consists of only 12 instructions, the host processor has to send in total 36 memory requests to the memory. -On the one hand because of the JUMP instruction, which is not executed itself, but repeats the previous instruction 7 times, and on the other hand because the memory requests have to be sent to both \ac{pch} which effectively executes the microkernel twice. +This is partly because of the JUMP instruction, which is not executed itself but repeats the previous instruction seven times, and partly because the memory requests have to be sent to both \acp{pch}, effectively executing the microkernel twice. The host processor must now exit the \ac{abp} mode and enter the \ac{sb} mode, load the partial sum vector from memory, reduce it, and possibly prepare it for the next \ac{dnn} layer in the same way as the input vector was prepared. \subsubsection{Benchmark Environment} diff --git a/src/chapters/implementation/vm.tex b/src/chapters/implementation/vm.tex index 64b75cb..71d34aa 100644 --- a/src/chapters/implementation/vm.tex +++ b/src/chapters/implementation/vm.tex @@ -4,7 +4,7 @@ \subsubsection{Integration} To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels. They must also receive the burst data from the \acp{ssa} and the burst address to calculate the register indices for the \ac{aam} mode. -However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller. +However, as already described in \cref{sec:pim_fim}, there is no need to modify either the frontend or the backend of DRAMSys, since the memory controller remains unchanged. In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}. As a result, the \aca{fimdram} model itself does not need to model any timing behavior: Its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys. @@ -23,12 +23,14 @@ When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank addres This mode can be used by the host to initialize the input vector chunk interleaving as described in \cref{sec:memory_layout}, or to initialize the \ac{crf} of the processing unit with the microkernel, which should be the same for all operating banks. After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units. -The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units. -In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor. +The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the bank address except for its parity. +In addition, it passes the column and row address and, in the case of a read, the bank data fetched. +With a write access, the output of the processing unit is written directly to the corresponding bank. +The actual data of the transaction object coming from the host processor is ignored. This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks. \subsubsection{Implementation} -So far, only the additional infrastructure in the \ac{dram} model of DRAMSys for the integration of the processing units have been described. +So far, only the additional infrastructure in the \ac{dram} model of DRAMSys for the integration of the processing units has been described. The next step is the implementation of the processing units themselves. A processing unit's internal state consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction. As a simplification of the model, the \acp{crf} are not stored in each \ac{pim} unit, but are stored once globally for each \ac{pch}. diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index a502265..6ffefa6 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -125,8 +125,8 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a \label{sec:pim_fim} One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \acf{fimdram}. -As this is the \ac{pim} architecture which was implemented during the work on this thesis, it will be explained in great detail. -The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with the \cref{sec:memory_layout} being mainly based on \cite{kang2022}. +As this is the \ac{pim} architecture which is implemented as a \ac{vp} during the work on this thesis, it will be explained in great detail. +The following subsections are mainly based on \cite{lee2021} and \cite{kwon2021}, with \cref{sec:memory_layout} being mainly based on \cite{kang2022}. \subsubsection{Architecture} \label{sec:pim_architecture} @@ -194,7 +194,7 @@ This makes \aca{fimdram} less effective for very small workloads, as the overhea \subsubsection{Instruction Set} -The \aca{fimdram} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions. +The \aca{fimdram} processing units provide a total of nine 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions. The data layout of these three instruction groups is shown in \cref{tab:isa}. \begin{table} @@ -211,7 +211,7 @@ Finally, the MOV and FILL instructions are used to move data between the memory The DST and SRC fields specify the operand type. That is, the register file or bank affected by the operation. -Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the \cref{tab:isa} by \textit{\#} symbols. +Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in \cref{tab:isa} by \textit{\#} symbols. The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., the clamping of negative values to zero, while the data is moved to another location. Another special field \textit{A} enables the \ac{aam}, which will be explained in more detail in \cref{sec:instruction_ordering}. @@ -249,7 +249,7 @@ For the rest of this thesis, it is assumed, that a \ac{rd} is issued for these i \subsubsection{Instruction Ordering} \label{sec:instruction_ordering} -Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} store instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel. +Since the execution of an instruction in the microkernel is initiated by a memory access, the host processor must execute \ac{ld} or \ac{st} instructions in a sequence that perfectly matches the loaded \ac{pim} microkernel. When an instruction has a bank as its specified source or destination, the addresses of these memory accesses specify the exact row and column where the data should be loaded from or stored to. This means that the order of the respective memory accesses for such instructions is important and must not be reordered, as it must match the corresponding instruction in the microkernel. For example, as shown in \cref{lst:reorder}, two consecutive \ac{mac} instructions with the memory bank as one of the operand sources already specify the respective register index. @@ -371,7 +371,7 @@ The operation of this concrete \ac{gemv} microkernel is illustrated in \cref{img \label{img:memory_layout} \end{figure} -In the \cref{img:memory_layout} it can be seen that a processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum. +In \cref{img:memory_layout}, it can be seen that one processing unit is responsible for multiplying and adding one row of the matrix with the input vector in eight cycles, forming the partial sum. This example only demonstrates the execution of the native matrix dimensions for one \ac{pch}. Increasing the number of rows in the matrix requires additional iterations of this 8-cycle microkernel, while feeding in the other memory addresses for the subsequent matrix rows. However, the additional matrix rows must be stored as a separate matrix after the first 8-row matrix block, forming an array of separate 8-row matrices. diff --git a/src/doc.bib b/src/doc.bib index 570df17..2a36022 100644 --- a/src/doc.bib +++ b/src/doc.bib @@ -1,4 +1,4 @@ -@article{2021, +@misc{2021, title = {Changing {{Exception}} Level and {{Security}} State in an Embedded Image}, date = {2021}, langid = {english}, @@ -19,7 +19,7 @@ file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGD8N29E/Antonino et al. - 2018 - Enabling Continuous Software Engineering for Embed.pdf} } -@article{arm2015, +@misc{arm2015, title = {{{ARM Cortex-A Series Programmer}}’s {{Guide}} for {{ARMv8-A}}}, author = {{ARM}}, date = {2015-03-24}, @@ -29,7 +29,7 @@ file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf} } -@article{arm2020, +@misc{arm2020, title = {Neon {{Programmer Guide}} for {{Armv8-A Coding}} for {{Neon}}}, author = {{ARM}}, date = {2020-07-05}, @@ -100,7 +100,7 @@ file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DQ9B36IG/Gabbay et al. - 2022 - Deep Neural Network Memory Performance and Through.pdf} } -@article{gao2017, +@misc{gao2017, title = {Bare-Metal {{Boot Code}} for {{ARMv8-A Processors}}}, author = {Gao, William}, date = {2017-03-31}, @@ -419,7 +419,8 @@ @online{nalgebra, title = {Linear Algebra Library for the {{Rust}} Programming Language}, - url = {https://nalgebra.org/} + url = {https://nalgebra.org/}, + urldate = {2024-01-08} } @book{nielsen2015,