From e1f7033883207418c0fe1c97a86e03f046cf6f1f Mon Sep 17 00:00:00 2001
From: Derek Christ <christ.derek@gmail.com>
Date: Sat, 9 Mar 2024 19:34:41 +0100
Subject: [PATCH] Smaller fixes

---
 src/chapters/results.tex | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/chapters/results.tex b/src/chapters/results.tex
index 29ab964..b1ab403 100644
--- a/src/chapters/results.tex
+++ b/src/chapters/results.tex
@@ -12,12 +12,12 @@ A set of simulations is then run based on these parameters and the resulting per
 The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
 Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
 A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
-The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$
+The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$.
 The data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
-Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
+Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of ${\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}}$.
 In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
 To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
-Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
+Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of ${\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
 The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
 
 % some implementation details
@@ -74,7 +74,7 @@ This allows an exaggerated evaluation of the performance gains of \ac{pim} in an
 
 % comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
 
-\subsection{Simulation Results}
+\subsection{Microbenchmarks}
 \subsubsection{Vector Operations}
 % Vector ADD und Vector MUL
 % Vector Skalar ADD und Vector Skalar MUL (HCAL) (wird wohl übersprungen)
@@ -260,7 +260,7 @@ Level 4  & (8k $\times$ 8k) & (16M)
 \label{tab:samsung_dimensions}
 \end{table}
 
-As can be seen, the dimensions for the \ac{gemv} benchmark and the vector add operations, which corresponds to the VADD benchmark of this thesis, match the dimensions used in the previously discussed simulations.
+As can be seen, the dimensions for the \ac{gemv} benchmark and the vector add operations, which corresponds to the VADD benchmark of this thesis, are the same as those used in the previously discussed simulations.
 Therefore, the simulations can be directly compared to gain a good understanding of how accurate they are in comparison to the real system manufactured by Samsung.
 
 Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective.
@@ -275,7 +275,7 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc
 \end{figure}
 
 The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
-When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung.
+Compared to the simulated platform, the variance is also limited, but the speedup is approximately $\qty{12.7}{\times}$, which is almost an order of magnitude higher than the findings of Samsung.
 This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
 Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
 However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.