Write out some numbers

This commit is contained in:
2024-03-08 23:09:53 +01:00
parent e999f46141
commit 42ef5e7672
5 changed files with 11 additions and 11 deletions

View File

@@ -109,16 +109,16 @@ Such a cube is then placed onto a common silicon interposer that connects the \a
This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance. This packaging brings the memory closer to the \ac{mpsoc}, which allows for an exceptionally wide memory interface and a minimized bus capacitance.
For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}. For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integration enables $\qtyrange[range-units=single]{10}{13}{\times}$ more \ac{io} connections to the \ac{mpsoc} and a $\qtyrange[range-units=single]{2}{2.4}{\times}$ lower energy per bit-transfer \cite{lee2021}.
A memory stack supports up to 8 independent memory channels, each containing up to 16 banks divided into 4 bank groups. A memory stack supports up to eight independent memory channels, each containing up to 16 banks divided into four bank groups.
The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$. The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$.
The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is. The \aca{hbm} standard defines two modes of operation~-~in legacy mode, the data bus operates as is.
In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}. In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to allow independent data tranfer, further increasing parallelism, while sharing a common command and address bus between the two \acp{pch}.
With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which results in $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack. With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which results in $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack.
A single data transfer is performed with either a \ac{bl} of 2 in legacy mode or 4 in \ac{pch} mode. A single data transfer is performed with either a \ac{bl} of 2 in legacy mode or 4 in \ac{pch} mode.
Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus. Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of four over the $\qty{64}{\bit}$ wide data bus.
\cref{img:hbm} illustrates the internal architecture of a single memory die. \cref{img:hbm} illustrates the internal architecture of a single memory die.
It consists of 2 independent channels, each with 2 \acp{pch} of 4 bank groups with 4 banks each, resulting in 16 banks per \ac{pch}. It consists of two independent channels, each with two \acp{pch} of four bank groups with four banks each, resulting in 16 banks per \ac{pch}.
In the center of the die, the \acp{tsv} connect the die to the next die above it and the previous die below it. In the center of the die, the \acp{tsv} connect the die to the next die above it and the previous die below it.
\begin{figure} \begin{figure}

View File

@@ -21,7 +21,7 @@ Ideally, this memory region is also set as non-cacheable, so that the messages d
Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region. Alternatively, the software library must ensure that the cache is flushed after the \ac{json} message is written to the memory region.
With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units. With the mode setting implemented, the shared library also provides type definitions to represent the \ac{pim} instructions in memory and to transfer entire microkernels consisting of 32 instructions to the processing units.
An instruction is simply represented by one of 9 different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}. An instruction is simply represented by one of nine different \texttt{enum} variants, each holding its necessary fields, such as the source or destination register files, as shown in \cref{lst:instruction_enums}.
\begin{listing} \begin{listing}
\begin{minipage}[t,c]{0.45\linewidth} \begin{minipage}[t,c]{0.45\linewidth}
\begin{minted}{rust} \begin{minted}{rust}
@@ -118,7 +118,7 @@ Since a memory request triggers the execution of all processing units in a \ac{p
From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed. From the point of view of the processor, only data in the first (even) or second (odd) bank is ever accessed.
This requires a special indexing of the input vectors and matrices, since they must be accessed very sparsely. This requires a special indexing of the input vectors and matrices, since they must be accessed very sparsely.
In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of 8 times. In the case of the input vector, where one 16-wide \ac{simd} vector of \ac{fp16} elements is repeated as often as there are banks in a \ac{pch}, a burst access must occur every $\qty{32}{\byte}\cdot\mathrm{number\ of\ banks\ per\ \ac{pch}}=\qty{512}{\byte}$, over the entire interleaved input vector for a maximum of eight times.
This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector. This way, all available \ac{grf}-A registers in a processing unit are used to hold its copy of the input vector.
To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied. To then perform the repeated \ac{mac} operation with the weight matrix as bank data, a similar logic must be applied.
Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$. Since each row of the matrix resides on its own memory bank, with an interleaving of the size of a 16-wide \ac{simd} vector of \ac{fp16} elements, also one memory access must be issued every $\qty{512}{\byte}$.

View File

@@ -26,7 +26,7 @@ It is therefore required to achieve radical improvements in the energy efficienc
In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}. In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} have become very popular, as they provide orders of magnitude higher performance and energy efficiency for the training and inference of \ac{ai} applications than general-purpose processors \cite{kwon2021}.
However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them: However, research must also take into account the off-chip memory~-~moving data between the processor and the \ac{dram} is very costly, as fetching the operands consumes more power than performing the computation on them:
While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost three orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}.
Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}. Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}.
In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}. In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}.

View File

@@ -177,9 +177,9 @@ The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions t
One program that is stored in the \ac{crf} is called a \textit{microkernel}. One program that is stored in the \ac{crf} is called a \textit{microkernel}.
As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}. As explained earlier, the operands of an instruction come either directly from the bank or from the \acp{srf} or \acp{grf}.
Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst. Each \ac{grf} consists of 16 registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst.
The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks. The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with eight register entries allocated to each of the two banks.
Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication. Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication.
It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each. It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with eight entries each.
This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter. This processing unit architecture is illustrated in \cref{img:pcu}, along with the local bus interfaces to its even and odd bank, and the control unit that decodes the instructions and keeps track of the program counter.
\begin{figure} \begin{figure}

View File

@@ -10,12 +10,12 @@ A set of simulations is then run based on these parameters and the resulting per
\subsection{System Architecture} \subsection{System Architecture}
The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}. The memory configuration used in the simulations has already been partially introduced in \cref{sec:memory_configuration}.
Each \ac{pim}-enabled \ac{pch} contains 8 processing units, each of which is connected to 2 memory banks. Each \ac{pim}-enabled \ac{pch} contains eight processing units, each of which is connected to two memory banks.
A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$. A processing unit operates at the same frequency as a \aca{hbm} \ac{dram} device with $\qty{250}{\mega\hertz}$.
The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}. The external clocking of the memory bus itself is $\qty{4}{\times}$ higher with a frequency of $\qty{1}{\giga\hertz}$, the data, address and command bus of \aca{hbm} operate at \ac{ddr} \cite{lee2021}.
Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$. Thus, with both the 16-wide \ac{fp} adder and the 16-wide \ac{fp} multiplier, a single processing unit achieves a throughput of $\num{2} \cdot \qty{16}{FLOP} \cdot \qty{250}{\mega\hertz}=\qty{8}{\giga FLOPS}$.
In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$. In total, the 16 processing units in a memory channel provide a throughput of $\num{16}\cdot\qty{8}{\giga FLOPS}=\qty{128}{\giga FLOPS}$.
To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds 8 \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}. To compare this throughput with the vector processing unit of a real processor, a very simplified assumption can be made based on the ARM NEON architecture, which holds eight \ac{fp16} numbers in a single $\qty{128}{\bit}$ vector register \cite{arm2020}.
Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel. Assuming the single processor core runs at a frequency of $\qty{3}{\giga\hertz}$, the vector processing unit can achieve a maximum throughput of $\qty{8}{FLOP} \cdot \qty{3}{\giga\hertz}=\qty{24}{\giga FLOPS}$, which is about $\qty{5}{\times}$ less than the \aca{fimdram} throughput of a single memory channel.
The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache. The simulated ARM system also contains a two-level cache hierarchy with a cache size of $\qty{16}{\kibi\byte}$ for the L1 cache and $\qty{256}{\kibi\byte}$ for the L2 cache.
@@ -275,7 +275,7 @@ Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized acc
The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$. The performed ADD microbenchmark of Samsung show a small variance between the different input dimensions with an average speedup value of around $\qty{1.6}{\times}$.
When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung. When compared to the simulated platform, the variance is also limited with a value of around $\qty{12.7}{\times}$, which almost an order of magnitude higher than the findings of Samsung.
This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}. This may be a surprising result, since such vector operations are inherently memory-bound and should be a prime candidate for the use of \ac{pim}.
Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after 8 \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}. Samsung explains its low value of $\qty{1.6}{\times}$ by the fact that after eight \ac{rd} accesses, the processor has to introduce a memory barrier instruction, resulting in a severe performance hit \cite{lee2021}.
However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain. However, this memory barrier has also been implemented in the VADD kernel of the simulations, which still show a significant performance gain.
The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation. The \ac{gemv} microbenchmark on the other hand shows a more matching result with an average speedup value of $\qty{8.3}{\times}$ for Samsung's implementation, while the simulation of this thesis achieved an average speedup of $\qty{9.0}{\times}$ which is well within the reach of the real hardware implementation.