diff --git a/src/acronyms.tex b/src/acronyms.tex index e0f1a22..7dce1fd 100644 --- a/src/acronyms.tex +++ b/src/acronyms.tex @@ -211,6 +211,22 @@ short = SRF, long = scalar register file, } +\DeclareAcronym{fp16}{ + short = FP16, + long = 16-bit floating-point, +} +\DeclareAcronym{fp32}{ + short = FP32, + long = 32-bit floating-point, +} +\DeclareAcronym{relu}{ + short = ReLU, + long = rectified linear unit, +} +\DeclareAcronym{aam}{ + short = AAM, + long = address aligned mode, +} \DeclareAcronym{tlm}{ short = TLM, long = transaction-level modeling, diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex index 7c1e5da..3ac8cc4 100644 --- a/src/chapters/dram.tex +++ b/src/chapters/dram.tex @@ -111,7 +111,7 @@ For example, compared to a conventional \ac{ddr4} \ac{dram}, this tight integrat One memory stack supports up to 8 independent memory channels, each of which containing up to 16 banks, which are divided into 4 bank groups. The command, address and data bus operate at \ac{ddr}, i.e., they transfer two words per interface clock cycle $t_{CK}$. -With a $t_{CK}$ of $\qty{1}{\giga\hertz}$ \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, resulting in $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack. +With a $t_{CK}$ of $\qty{1}{\giga\hertz}$, \aca{hbm} achieves a pin transfer rate of $\qty{2}{\giga T \per\second}$, which gives $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ per \ac{pch} and a total of $\qty[per-mode = symbol]{256}{\giga\byte\per\second}$ for the 1024-bit wide data bus of each stack. A single data transfer is performed with either a \ac{bl} of 2 or 4, depending on the \ac{pch} configuration. In \ac{pch} mode, the data bus is split in half (i.e., 64-bit) to enable independent data transmission, further increasing parallelism while sharing a common command and address bus between the two \acp{pch}. Thus, accessing \aca{hbm} in \ac{pch} mode transmits a $\qty{256}{\bit}=\qty{32}{\byte}$ burst with a \ac{bl} of 4 over the $\qty{64}{\bit}$ wide data bus. @@ -123,7 +123,7 @@ In the center of the die, the \acp{tsv} connect to the next die above or the pre \begin{figure} \centering \includegraphics[width=0.8\linewidth]{images/hbm} - \caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}} + \caption[\aca{hbm} memory die architecture]{\aca{hbm} memory die architecture \cite{lee2021}.} \label{img:hbm} \end{figure} diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index 52e0401..44a4bf5 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -42,14 +42,14 @@ Secondly, \ac{pim} comes with the further limitation that it can only accelerate \label{sec:pim_architectures} Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors. -These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level. +These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level. In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}: \begin{enumerate} -\item Inside the memory \ac{subarray}. -\item In the \ac{psa} region near a \ac{subarray}. -\item Outside the bank in its peripheral region. -\item In the \ac{io} region of the memory. + \item Inside the memory \ac{subarray}. + \item In the \ac{psa} region near a \ac{subarray}. + \item Outside the bank in its peripheral region. + \item In the \ac{io} region of the memory. \end{enumerate} Each of these approaches comes with different advantages and disadvantages. @@ -109,7 +109,7 @@ To make full use of the output buffering, the matrix rows are interleaved in an \begin{figure} \centering \input{images/hynix} - \caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}} + \caption[Newton memory layout for a \ac{gemv} operation]{Newton memory layout for a \ac{gemv} operation \cite{he2020}.} \label{img:hynix} \end{figure} @@ -125,6 +125,8 @@ As a result, Newton promises a $\qtyrange{10}{54}{\times}$ speedup compared to a \subsection{FIMDRAM/HBM-PIM} \label{sec:pim_fim} +\subsubsection{Architecture} + One year after SK Hynix, the major \ac{dram} manufacturer Samsung announced its own \ac{pim} \ac{dram} implementation, called \ac{fimdram} or \ac{hbm}-\ac{pim}. As the name suggests, it is based on the \aca{hbm} memory standard, and it integrates 16-wide \ac{simd} engines directly into the memory banks, exploiting bank-level parallelism, while retaining the highly optimized \acp{subarray} \cite{kwon2021}. A major difference from Newton \ac{pim} is that \ac{hbm}-\ac{pim} does not require any changes to components of modern processors, such as the memory controller, i.e. it is agnostic to existing \aca{hbm} platforms. @@ -134,17 +136,110 @@ Fortunately, as discussed in Section \ref{sec:hbm}, the architecture of \ac{hbm} At the heart of the \ac{hbm}-\ac{pim} are the \ac{pim} execution units, which are shared by two banks of a \ac{pch}. They include 16 16-bit wide \ac{simd} \acp{fpu}, \acp{crf}, \acp{grf} and \acp{srf} \cite{lee2021}. This general architecture is shown in detail in Figure \ref{img:fimdram}, with (a) the placement of the \ac{pim} units between the memory banks of a \ac{dram} die, with (b) a bank coupled to its \ac{pim} unit, and (c) the data path in around a \ac{fpu} within the \ac{pim} unit. + \begin{figure} \centering \includegraphics[width=\linewidth]{images/fimdram} - \caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}} + \caption[Architecture of \ac{hbm}-\ac{pim}]{Architecture of \ac{hbm}-\ac{pim} \cite{lee2021}.} \label{img:fimdram} \end{figure} + As it can be seen in (c), the input data to the \ac{fpu}can either come directly from the memory bank, from a \ac{grf}/\ac{srf} or from the result bus of a previous computation. The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \aca{hbm}, where 16 16-bit floating-point operands are passed directly from the \acp{psa} to the \acp{fpu} from a single memory access. As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}*\qty{16}{banks}=\qty{4096}{\bit}$ into the \acp{fpu}. As a result, the theoretical internal bandwidth of \ac{hbm}-\ac{pim} is $\qty{16}{\times}$ higher than the connection to the external bus to the host processor. +\Ac{hbm}-\ac{pim} defines three operating modes: +\begin{enumerate} + \item \textbf{Single Bank Mode}: + This is the default operating mode, where \ac{hbm}-\ac{pim} has identical behavior to normal \aca{hbm} memory. + To switch to another mode, a specific sequence of \ac{act} and \ac{pre} commands must be sent by the memory controller to a specific row address. + \item \textbf{All-Bank Mode}: + The all-bank mode is an extension of the single bank mode where the \ac{pim} execution units allow for concurrent access to half of the \ac{dram} banks at the same time. + This provides $\qty{8}{\times}$ more bandwidth than the standard operation mode, which can be used for the initialization of memory regions across all banks. + \item \textbf{All-Bank-\ac{pim} Mode}: + With another predefined \ac{dram} access sequence, the memory switches to the \ac{pim} enabled mode. + In this mode, a single memory access initiates the concurrent execution of the next instruction across all processing units. + In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation. +\end{enumerate} -% unterschiede zu hynix pim -% benchmark ergebnisse von samsung... +Both in all-bank mode and in all-bank-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}. + +\subsubsection{Processing Unit} + +Due to the focus on \ac{dnn} applications in \ac{hbm}-\ac{pim}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}, as well as the good support of \ac{fp16} for modern processor architectures. +The \ac{simd} \ac{fpu} is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. +In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}. +The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when accessing memory. +As explained earlier, the operands come either directly from the bank or from the \acp{srf} or \acp{grf}. +Each \ac{grf} consists of 16 256-bit registers, each with the \aca{hbm} prefetch size of 256 bits, where each entry can hold the data of a full memory burst. +The \ac{grf} of a processing unit is divided into two halves (\ac{grf}-A and \ac{grf}-B), with 8 register entries allocated to each of the two banks. +Finally, in the \acp{srf}, a 16-bit scalar value is replicated 16 times as it is fed into the 16-wide \ac{simd} \ac{fpu} as a constant summand or factor for an addition or multiplication. +It is also divided into two halves (\ac{srf}-A and \ac{srf}-M) for addition and multiplication with 8 entries each. +This processing unit architecture is illustrated in Figure \ref{img:pcu}, along with the local bus interfaces to its even and odd banks, and the control unit that, among other things, decodes the instructions and keeps track of the program counter. + +\begin{figure} + \centering + \includegraphics[width=0.8\linewidth]{images/pcu} + \caption[Architecture of a \ac{pim} processing unit]{Architecture of a \ac{pim} processing unit \cite{lee2021}.} + \label{img:pcu} +\end{figure} + +\subsubsection{Instruction Set} + +The \ac{hbm}-\ac{pim} processing units provide a total of 9 32-bit \ac{risc} instructions, each of which falls into one of three groups: control flow instructions, arithmetic instructions and data movement instructions. +The data layout of these three instruction groups is shown in Table \ref{tab:isa}. + +\begin{table} + \centering + \includegraphics[width=\linewidth]{images/isa} + \caption[The instruction format of the processing units]{The instruction format of the processing units \cite{lee2021}.} + \label{tab:isa} +\end{table} + +For the control flow instructions, there is NOP, which does not perform any operation, JUMP, which performs a fixed iteration jump to an offset instruction, and EXIT, which restores the internal state of the processing unit. +The arithmetic instructions perform operations such as simple ADD and MUL, but also support \ac{mac} and \ac{mad} operations, which are key for accelerating \ac{dnn} applications. +Finally, the MOV and FILL instructions are used to move data between the memory banks and the \ac{grf} and \ac{srf} register files. + +The DST and SRC fields specify the operand type, i.e., the register file or bank affected by the operation. +Depending on the source or destination operand types, the instruction encodes indices for the concrete element in the register files, which are denoted in the Table \ref{tab:isa} by \textit{\#} symbols. +The special field \textit{R} for the data movement instruction type enables a \ac{relu} operation, i.e., clamping negative values to zero, while the data is moved to another location. +Another special field \textit{A} enabled the \ac{aam}, which will be explained in more detail in Section \ref{sec:instruction_ordering}. + +\begin{table} + \centering + \resizebox{\linewidth}{!}{% + \begin{tblr}{ + hlines, + vlines, + hline{2} = {-}{solid,black}, + hline{2} = {2}{-}{solid,black}, + } + Type & Command & Description & Result (DST) & Operand (SRC0) & Operand (SRC1) & Operand (SRC2) \\ + Control & NOP & no operation & & & & \\ + Control & JUMP & jump instruction & & & & \\ + Control & EXIT & exit instruction & & & & \\ + Data & MOV & {move data\\from bank/register\\to register} & GRF, SRF & GRF, BANK & & \\ + Data & FILL & {move data\\from bank/register\\to bank/register} & GRF, BANK & GRF, BANK & & \\ + Arithmetic & ADD & addition & GRF & GRF, BANK, SRF & GRF, BANK, SRF & \\ + Arithmetic & MUL & multiplication & GRF & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\ + Arithmetic & MAC & multiply-accumulate & GRF-B & GRF, BANK & GRF, BANK, SRF & GRF, BANK, SRF \\ + Arithmetic & MAD & multiply-and-add & GRF & GRF, BANK & GRF, BANK, SRF & + \end{tblr}} + \caption[A list of the supported instructions with possible sources and destinations]{A list of the supported instructions with possible sources and destinations \cite{shin-haengkang2023}.} + \label{tab:instruction_set} +\end{table} + +The Table \ref{tab:instruction_set} gives an overview of all available instructions and defines the possible operand sources and destinations. +It is to note, that some operations do require either a \ac{rd} or a \ac{wr} access to execute properly. +For example, to write the resulting output vector from a \ac{grf} to the memory banks, the memory controller must issue a \ac{wr} command to write to the bank. +Likewise, reading from the banks, requires a \ac{rd} command. +For the control types and arithmetic instructions without the bank as a source operand, an either a \ac{rd} or a \ac{wr} can be issued to execute the instruction. +The rest of this thesis, it is assumed, that a \ac{rd} is issued for these instructions. + +\subsubsection{Instruction Ordering} +\label{sec:instruction_ordering} + +\subsubsection{Memory Layout} + +\subsubsection{Performance and Power Efficiency Effects} diff --git a/src/doc.bib b/src/doc.bib index 521131e..5874255 100644 --- a/src/doc.bib +++ b/src/doc.bib @@ -446,6 +446,16 @@ file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3J45PFD2/Seshadri und Mutlu - 2020 - In-DRAM Bulk Bitwise Execution Engine.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DTK64DHZ/1905.html} } +@misc{shin-haengkang2023, + title = {{{PIMSimulator}}}, + author = {{Shin-haeng Kang} and {Sanghoon Cha} and {Seungwoo Seo} and {Jin-seong Kim}}, + year = {2023}, + month = nov, + url = {https://github.com/SAITPublic/PIMSimulator}, + urldate = {2024-02-08}, + abstract = {Processing-In-Memory (PIM) Simulator} +} + @misc{src2021, title = {Decadal {{Plan}} for {{Semiconductors}}}, author = {{SRC}}, diff --git a/src/images/isa.pdf b/src/images/isa.pdf new file mode 100644 index 0000000..e0bb2a7 Binary files /dev/null and b/src/images/isa.pdf differ diff --git a/src/images/pcu.pdf b/src/images/pcu.pdf new file mode 100644 index 0000000..4a8966b Binary files /dev/null and b/src/images/pcu.pdf differ diff --git a/src/index.tex b/src/index.tex index 42da576..05a57e6 100644 --- a/src/index.tex +++ b/src/index.tex @@ -22,6 +22,8 @@ \usepackage{pgfplots} \usepackage{bytefield} \usepackage{mathdots} +\usepackage{tabularray} +\usepackage{makecell} % Configurations \usetikzlibrary{matrix}