diff --git a/src/acronyms.tex b/src/acronyms.tex
index 2dba8d1..1c7b303 100644
--- a/src/acronyms.tex
+++ b/src/acronyms.tex
@@ -22,6 +22,10 @@
     short = AI,
     long = artificial intelligence,
 }
+\DeclareAcronym{cpu}{
+    short = CPU,
+    long = central processing unit,
+}
 \DeclareAcronym{gpu}{
     short = GPU,
     long = graphics processing unit,
@@ -204,6 +208,10 @@
     short = FPU,
     long = floating-point unit,
 }
+\DeclareAcronym{fp}{
+    short = FP,
+    long = floating-point,
+}
 \DeclareAcronym{crf}{
     short = CRF,
     long = command register file,
diff --git a/src/appendix.tex b/src/appendix.tex
index f4a5669..2d858db 100644
--- a/src/appendix.tex
+++ b/src/appendix.tex
@@ -4,4 +4,5 @@
 % etwas source code,
 % von der vm
 % einige microkernels
+% auch execution der microkernel
 % ...
diff --git a/src/chapters/implementation/kernel.tex b/src/chapters/implementation/kernel.tex
index ed7755d..9656fe5 100644
--- a/src/chapters/implementation/kernel.tex
+++ b/src/chapters/implementation/kernel.tex
@@ -93,6 +93,7 @@ In order to incorporate this memory allocator, it was initialized by providing a
 The allocator can then dynamically use sections of this arena to allocate the \ac{pim} data structures.
 
 \subsubsection{Memory Configuration}
+\label{sec:memory_configuration}
 
 As already discussed in \cref{sec:memory_layout} and in \cref{sec:microkernel_execution}, certain requirements are posed onto the configuration of the memory system, such as the \ac{am}.
 These configurations can be set when instantiating DRAMSys while it is being connected to the gem5 memory bus.
diff --git a/src/chapters/results.tex b/src/chapters/results.tex
index e426c0a..1ad7ab4 100644
--- a/src/chapters/results.tex
+++ b/src/chapters/results.tex
@@ -1,4 +1,86 @@
 \section{Simulation Results}
 \label{sec:results}
 
-% gem5 m5ops routines/implementation in the kernel
+This section explores the potential performance improvement of \aca{fimdram} across different system configurations and workloads.
+After a brief introduction to the simulated system architecture, an estimated theoretical performance gain is calculated.
+This is followed by a discussion of the measurement accuracy and suggestions for improving the measurement environment.
+Furthermore, the variations of the system parameters for each workload will be explored.
+The set of simulations is then run based on these parameters and the resulting performance improvements are analyzed.
+Finally, a comparison between the execution time of the initialization of the operands and the microkernel execution time is performed to estimate the setup overhead of \aca{fimdram}.
+
+\subsection{System Architecture}
+The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
+Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks.
+A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
+The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
+Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
+In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
+To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
+Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel.
+
+% some implementation details
+% hbm size, channel...
+% operating at ...MHz
+% theoretical bandwidth and FLOPS...
+% ganz einfacher vergleich zu ARM FLOPS/cycle -> verhältnis im optimalfall
+
+\subsection{Accuracy and Comparability}
+When interpreting the following simulation results, it is important to note that the system configuration does not strictly reflect a system on which a real \ac{dnn} inference would be performed.
+Firstly, implementing the workloads on a bare-metal kernel simplifies the execution environment of the processor, since no other processes interact with it in any way.
+The process of the workloads is never preemptively interrupted and the effect of an interruption during the critical \ac{pim} microkernel execution cannot be analyzed.
+Secondly, for performance reasons, a \ac{dnn} inference is not typically run on a \ac{cpu} but on \acp{gpu} or \acp{tpu}.
+These accelerators may have significantly different execution behavior, as a \ac{gpu} may aggressively accelerate inference by performing many parallel operations, or a \ac{tpu} may use specialized nets for matrix vector operations such as systolic arrays.
+Such differences would also reflect themselves in the memory access pattern, and may be subject to other effects that alter the behavior of \aca{fimdram}.
+Furthermore, since the mode switching of \aca{fimdram} is not being measured in the simulations, the setup overhead is limited to the required layout conversions of the input operands.
+The high overhead of a \ac{pim} operation on a small data set may be underrepresented.
+Nevertheless, the simulations performed provide an informative insight on the effectiveness of \aca{fimdram} and its suited workloads.
+
+% bare-metal ist optimalfall, linux wäre realistischere testumgebung
+% overhead der setuptime kann nicht richtig gemessen werden
+% Inference auf CPU ist untypisch, GPU modell wäre geeigneter
+
+\subsection{Objectives}
+Through the simulations, the research aims to address and find answers to several objectives.
+As already discussed in \cref{sec:pim}, \ac{pim} aims to accellerate memory-bound problems such as \ac{gemv} and may only show a small performance gain, or even a worsening, for compute-bound problems such as \ac{gemm}.
+This difference should be analyzed by performing the simulations on various different workloads.
+For these workloads, the input dimensions may play an important role in how effective \ac{pim} is.
+Small dimensions suffer from a high impact of the setup overhead, while for large dimensions this effect may be less significant.
+The performance gains for different operand dimensions should be analyzed, possibly finding a break-even point at which \ac{pim} becomes viable.
+When performing inference of multiple \ac{dnn} layers, an activation function is typically applied to the output of each layer.
+\Aca{fimdram} provides a \ac{relu} operation that can be applied while moving the newly interleaved input vector into the \ac{grf}-A registers.
+The performance gain of applying this operation in memory instead of on the host processor after reducing the partial sums of the output vector can be investigated.
+Furthermore, the concrete number of processing units in a \ac{pch} is in the compromise of the removal of the usable memory area.
+Using the flexible simulation model, it is possible to analyze the impact of the shared processing unit architecture compared to a hypothetical solution where each bank is connected to its own processing unit.
+
+To evaluate these objectives, the set of simulations is each employed in four different configurations:
+With the two configurations of a generic ARM processor running at a frequency of $\qty{3}{\giga\hertz}$, once with \ac{pim} enabled and once performing the operations only on the processor, a realistic configuration should be achieved.
+However, also two configurations with the same ARM processor but with a nearly infinite frequency is performed.
+While these configurations do not reflect a real system, they are used to address the already mentioned concerns about the meaningfulness of performing the simulations on a \ac{cpu}.
+With infinite computational power, the simulation is guaranteed to be bounded only by the memory system.
+This allows an exaggerated evaluation of the performance gains of \ac{pim} in an optimal environment, where only the effect on memory boundedness can be observed.
+
+% different kernels
+% shared pim units (-> halbe rows / halbe performance, soll überprüft werden)
+% sweep of matrix dimensions rows/columns, break even point
+% ReLU in DRAM vs on host
+
+% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
+
+\subsection{Simulation Results}
+\subsubsection{Workload Kernels}
+% Vector ADD und Vector MUL
+% Vector Skalar ADD und Vector Skalar MUL
+% GEMV
+    % Samsung 7.4x-8.9x
+% "inference" mit mehreren layern
+     % ReLU vergleich
+
+% GEMM mit stark interleavten matrizen
+
+\subsubsection{Initialization Overhead}
+% conversion der operanden im verhältnis zur laufzeit abschätzen
+
+\subsubsection{Shared Processing Units}
+% Geteilte processing units vs jede Bank eine
+% GEMV
+
diff --git a/src/doc.bib b/src/doc.bib
index 13677fd..f6158d0 100644
--- a/src/doc.bib
+++ b/src/doc.bib
@@ -24,10 +24,21 @@
   author = {{ARM}},
   date = {2015-03-24},
   url = {https://developer.arm.com/documentation/den0024/latest/},
+  urldate = {2024-01-08},
   langid = {english},
   file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/KGNI52X5/2015 - ARM Cortex-A Series Programmer’s Guide for ARMv8-A.pdf}
 }
 
+@article{arm2020,
+  title = {Neon {{Programmer Guide}} for {{Armv8-A Coding}} for {{Neon}}},
+  author = {{ARM}},
+  date = {2020-07-05},
+  url = {https://developer.arm.com/documentation/102159/latest/},
+  urldate = {2024-02-21},
+  langid = {english},
+  file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/QQI2QA25/2020 - Neon Programmer Guide for Armv8-A Coding for Neon.pdf}
+}
+
 @online{blas1979,
   title = {{{BLAS}} ({{Basic Linear Algebra Subprograms}})},
   author = {{BLAS}},
@@ -92,7 +103,9 @@
 @article{gao2017,
   title = {Bare-Metal {{Boot Code}} for {{ARMv8-A Processors}}},
   author = {Gao, William},
-  date = {2017},
+  date = {2017-03-31},
+  url = {https://developer.arm.com/documentation/dai0527/latest/},
+  urldate = {2024-01-08},
   langid = {english},
   file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/FAN7NPUM/Gao - Bare-metal Boot Code for ARMv8-A Processors.pdf}
 }