Fix bug in Simulation chapter
This commit is contained in:
@@ -16,7 +16,7 @@ The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with
|
||||
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
|
||||
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
|
||||
To compare this throughput to the vector processing unit of a real processor, a highly simplified assumption can be made based on the ARM NEON architecture that holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
|
||||
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel.
|
||||
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single channel.
|
||||
|
||||
% some implementation details
|
||||
% hbm size, channel...
|
||||
|
||||
Reference in New Issue
Block a user