From 42ef5e76729251f326760fa3d7a25e821e059a5c Mon Sep 17 00:00:00 2001
From: Derek Christ <christ.derek@gmail.com>
Date: Fri, 8 Mar 2024 23:09:53 +0100
Subject: [PATCH] Write out some numbers

---
 src/chapters/dram.tex                   | 6 +++---
 src/chapters/implementation/library.tex | 4 ++--
 src/chapters/introduction.tex           | 2 +-
 src/chapters/pim.tex                    | 4 ++--
 src/chapters/results.tex                | 6 +++---
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex
index be40de4..9a9155f 100644
--- a/src/chapters/dram.tex
+++ b/src/chapters/dram.tex
@@ -109,16 +109,16 @@ Such a cube is then placed onto a common silicon interposer that connects the \a
 This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance.
 For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
 
-A memory stack supports up to 8 independent memory channels, each containing up to 16 banks divided into 4 bank groups.
+A memory stack supports up to eight independent memory channels, each containing up to 16 banks divided into four bank groups.
 The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
 The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is.
 In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}.
 With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which results in $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
 A single data transfer is performed with either a \ac{bl} of 2 in legacy mode or 4 in \ac{pch} mode.
-Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus.
+Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of four over the $\qty{64}{\bit}$ wide data bus.
 
 \cref{img:hbm} illustrates the internal architecture of a single memory die.
-It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}.
+It consists of two independent channels, each with two \acp{pch} of four bank groups with four banks each, resulting in 16 banks per \ac{pch}.
 In the center of the die, the \acp{tsv} connect the die to the next die above it and the previous die below it.
 
 \begin{figure}
diff --git a/src/chapters/implementation/library.tex b/src/chapters/implementation/library.tex
index b509235..d63b5dc 100644
--- a/src/chapters/implementation/library.tex
+++ b/src/chapters/implementation/library.tex
@@ -21,7 +21,7 @@ Ideally, this memory region is also set as non-cacheable, so that the messages d
 Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region.
 
 With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units.
-An instruction is simply represented by one of 9 different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}.
+An instruction is simply represented by one of nine different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}.
 \begin{listing}
 \begin{minipage}[t,c]{0.45\linewidth}
 \begin{minted}{rust}
@@ -118,7 +118,7 @@ Since a memory request triggers the execution of all processing units in a \ac{p
 From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
 This requires a special indexing of the input vectors and matrices, since they must be accessed very sparsely.
 
-In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times.
+In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of eight times.
 This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
 To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
 Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.
diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex
index 85436ed..449e6e4 100644
--- a/src/chapters/introduction.tex
+++ b/src/chapters/introduction.tex
@@ -26,7 +26,7 @@ It is therefore required to achieve radical improvements in the energy efficienc
 
 In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}.
 However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them:
-While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
+While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost three orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
 
 Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}.
 In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.
diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex
index 6ffefa6..5361a12 100644
--- a/src/chapters/pim.tex
+++ b/src/chapters/pim.tex
@@ -177,9 +177,9 @@ The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions t
 One program that is stored in the \ac{crf} is called a \textit{microkernel}.
 As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}.
 Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
-The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks.
+The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks.
 Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
-It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each.
+It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each.
 This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.
 
 \begin{figure}
diff --git a/src/chapters/results.tex b/src/chapters/results.tex
index ff94cd8..489909b 100644
--- a/src/chapters/results.tex
+++ b/src/chapters/results.tex
@@ -10,12 +10,12 @@ A set of simulations is then run based on these parameters and the resulting per
 
 \subsection{System Architecture}
 The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
-Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
+Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
 A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
 The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
 Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
 In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
-To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
+To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
 Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
 The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
 
@@ -275,7 +275,7 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc
 The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
 When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung.
 This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
-Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after 8 \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
+Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
 However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.
 
 The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation.