From 42ef5e76729251f326760fa3d7a25e821e059a5c Mon Sep 17 00:00:00 2001 From: Derek Christ Date: Fri, 8 Mar 2024 23:09:53 +0100 Subject: [PATCH] Write out some numbers --- src/chapters/dram.tex | 6 +++--- src/chapters/implementation/library.tex | 4 ++-- src/chapters/introduction.tex | 2 +- src/chapters/pim.tex | 4 ++-- src/chapters/results.tex | 6 +++--- 5 files changed, 11 insertions(+), 11 deletions(-) diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex index be40de4..9a9155f 100644 --- a/src/chapters/dram.tex +++ b/src/chapters/dram.tex @@ -109,16 +109,16 @@ Such a cube is then placed onto a common silicon interposer that connects the \a This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance. For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}. -A memory stack supports up to 8 independent memory channels, each containing up to 16 banks divided into 4 bank groups. +A memory stack supports up to eight independent memory channels, each containing up to 16 banks divided into four bank groups. The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$. The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is. In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}. With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which results in $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack. A single data transfer is performed with either a \ac{bl} of 2 in legacy mode or 4 in \ac{pch} mode. -Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus. +Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of four over the $\qty{64}{\bit}$ wide data bus. \cref{img:hbm} illustrates the internal architecture of a single memory die. -It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}. +It consists of two independent channels, each with two \acp{pch} of four bank groups with four banks each, resulting in 16 banks per \ac{pch}. In the center of the die, the \acp{tsv} connect the die to the next die above it and the previous die below it. \begin{figure} diff --git a/src/chapters/implementation/library.tex b/src/chapters/implementation/library.tex index b509235..d63b5dc 100644 --- a/src/chapters/implementation/library.tex +++ b/src/chapters/implementation/library.tex @@ -21,7 +21,7 @@ Ideally, this memory region is also set as non-cacheable, so that the messages d Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region. With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units. -An instruction is simply represented by one of 9 different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}. +An instruction is simply represented by one of nine different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}. \begin{listing} \begin{minipage}[t,c]{0.45\linewidth} \begin{minted}{rust} @@ -118,7 +118,7 @@ Since a memory request triggers the execution of all processing units in a \ac{p From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed. This requires a special indexing of the input vectors and matrices, since they must be accessed very sparsely. -In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times. +In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of eight times. This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector. To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied. Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$. diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex index 85436ed..449e6e4 100644 --- a/src/chapters/introduction.tex +++ b/src/chapters/introduction.tex @@ -26,7 +26,7 @@ It is therefore required to achieve radical improvements in the energy efficienc In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}. However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them: -While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. +While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost three orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}. diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index 6ffefa6..5361a12 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -177,9 +177,9 @@ The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions t One program that is stored in the \ac{crf} is called a \textit{microkernel}. As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}. Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst. -The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks. +The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks. Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication. -It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each. +It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each. This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter. \begin{figure} diff --git a/src/chapters/results.tex b/src/chapters/results.tex index ff94cd8..489909b 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -10,12 +10,12 @@ A set of simulations is then run based on these parameters and the resulting per \subsection{System Architecture} The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}. -Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks. +Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks. A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$. The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}. Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$. In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$. -To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. +To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel. The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache. @@ -275,7 +275,7 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung. This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}. -Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after 8 \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}. +Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}. However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain. The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation.