Smaller fixes
This commit is contained in:
@@ -12,12 +12,12 @@ A set of simulations is then run based on these parameters and the resulting per
|
|||||||
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
|
||||||
Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
|
Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
|
||||||
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
|
||||||
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$
|
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$.
|
||||||
The data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
The data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
|
||||||
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of ${\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}}$.
|
||||||
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
||||||
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
||||||
Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
|
Assuming a single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of ${\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
|
||||||
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
|
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
|
||||||
|
|
||||||
% some implementation details
|
% some implementation details
|
||||||
@@ -74,7 +74,7 @@ This allows an exaggerated evaluation of the performance gains of \ac{pim} in an
|
|||||||
|
|
||||||
% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
|
% comparison with normal clock and infinite compute (immer 4 simulationen, bzw. 5 mit echter hardware)
|
||||||
|
|
||||||
\subsection{Simulation Results}
|
\subsection{Microbenchmarks}
|
||||||
\subsubsection{Vector Operations}
|
\subsubsection{Vector Operations}
|
||||||
% Vector ADD und Vector MUL
|
% Vector ADD und Vector MUL
|
||||||
% Vector Skalar ADD und Vector Skalar MUL (HCAL) (wird wohl übersprungen)
|
% Vector Skalar ADD und Vector Skalar MUL (HCAL) (wird wohl übersprungen)
|
||||||
@@ -260,7 +260,7 @@ Level 4 & (8k $\times$ 8k) & (16M)
|
|||||||
\label{tab:samsung_dimensions}
|
\label{tab:samsung_dimensions}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
As can be seen, the dimensions for the \ac{gemv} benchmark and the vector add operations, which corresponds to the VADD benchmark of this thesis, match the dimensions used in the previously discussed simulations.
|
As can be seen, the dimensions for the \ac{gemv} benchmark and the vector add operations, which corresponds to the VADD benchmark of this thesis, are the same as those used in the previously discussed simulations.
|
||||||
Therefore, the simulations can be directly compared to gain a good understanding of how accurate they are in comparison to the real system manufactured by Samsung.
|
Therefore, the simulations can be directly compared to gain a good understanding of how accurate they are in comparison to the real system manufactured by Samsung.
|
||||||
|
|
||||||
Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective.
|
Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective.
|
||||||
@@ -275,7 +275,7 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc
|
|||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
|
The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
|
||||||
When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung.
|
Compared to the simulated platform, the variance is also limited, but the speedup is approximately $\qty{12.7}{\times}$, which is almost an order of magnitude higher than the findings of Samsung.
|
||||||
This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
|
This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
|
||||||
Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
|
Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
|
||||||
However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.
|
However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.
|
||||||
|
|||||||
Reference in New Issue
Block a user