Grammatical fixes in PIM chapter
This commit is contained in:
@@ -27,43 +27,42 @@ This process is illustrated in Figure \ref{img:dnn} where one \ac{dnn} layer is
|
||||
Such an operation, defined in the widely used \ac{blas} library \cite{blas1979}, is also known as a \acs{gemv} routine.
|
||||
Because one matrix element is only used exactly once in the calculation the output vector, there is no data reuse of the matrix.
|
||||
Further, as the weight matrices tend to be too large to fit on the on-chip cache, such a \ac{gemv} operation is deeply memory-bound \cite{he2020}.
|
||||
As a result, such an opertion is a good fit for \ac{pim}.
|
||||
As a result, such an operation is a good fit for \ac{pim}.
|
||||
|
||||
\subsection{PIM Architectures}
|
||||
\label{sec:pim_architectures}
|
||||
|
||||
Many different \ac{pim} architectures have been proposed in the past by research and recently also real implementions by hardware vendors have been presented.
|
||||
These proposals differ largely in their positioning of the applied processing operation, reaching from analogue distribution of capacitor charges on the \ac{subarray}-level, to additional processing units on the global \ac{io} level.
|
||||
In essence, these placements of the approaches can be summarised as follows \cite{sudarshan2022}:
|
||||
Many different \ac{pim} architectures have been proposed by research in the past, and more recently real implementations have been presented by hardware vendors.
|
||||
These proposals differ largely in the positioning of the processing operation applied, ranging from analogue distribution of capacitor charges at the \ac{subarray} level to additional processing units at the global \ac{io} level.
|
||||
In essence, these placements of the approaches can be summarized as follows \cite{sudarshan2022}:
|
||||
|
||||
\begin{enumerate}
|
||||
\item Inside the memory \ac{subarray}.
|
||||
\item At the \ac{psa} region near a \ac{subarray}.
|
||||
\item Outside of the bank in its peripheral region.
|
||||
\item At the \ac{io} region of the memory.
|
||||
\item In the \ac{psa} region near a \ac{subarray}.
|
||||
\item Outside the bank in its peripheral region.
|
||||
\item In the \ac{io} region of the memory.
|
||||
\end{enumerate}
|
||||
|
||||
Each of these approaches come with different advantages and disadvantages.
|
||||
In short, the nearer the processing happens to the memory \acs{subarray}, the higher is the energy efficiency achievable processing bandwidth.
|
||||
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints restrict the integration \cite{sudarshan2022}.
|
||||
% kurzer overview über die kategorien von PIM (paper vom lehrstuhl)
|
||||
Each of these approaches comes with different advantages and disadvantages.
|
||||
In short, the closer the processing is to the memory \acs{subarray}, the higher the energy efficiency and the achievable processing bandwidth.
|
||||
On the other hand, the integration of the \ac{pim} units becomes more difficult as area and power constraints limit the integration \cite{sudarshan2022}.
|
||||
|
||||
Processing inside the \ac{subarray} has the highest achievable level of parallelism with the number of operand bits equal to the size of the row.
|
||||
Processing inside the \ac{subarray} has the highest achievable level of parallelism, with the number of operand bits equal to the size of the row.
|
||||
It also requires the least amount of energy to load the data from the \acs{subarray} into the \acp{psa} to perform operations on it.
|
||||
The downside of this approach is the need to alter the highly for density optimized \ac{subarray} architecture.
|
||||
One example for such an approach is Ambit \cite{seshadri2020}.
|
||||
The downside of this approach is the need to modify the highly optimized \ac{subarray} architecture.
|
||||
An example of such an approach is Ambit \cite{seshadri2020}.
|
||||
Ambit provides a mechanism to activate multiple rows within a \ac{subarray} at once and perform bulk bitwise operations such as AND, OR and NOT on the row data.
|
||||
|
||||
Far fewer but still challenging constraints are posed onto the integration of compute units in the region of the \acp{psa}.
|
||||
\cite{sudarshan2022a} introduces a two-stage design that integrates current mirror based analog units near the \ac{subarray} that make \ac{mac} operations used in \ac{dnn} applications possible.
|
||||
Far fewer, but still challenging, constraints are placed on the integration of compute units in the region of the \acp{psa}.
|
||||
\cite{sudarshan2022a} presents a two-stage design that integrates current mirror-based analogue units near the \ac{subarray} that enable \ac{mac} operations used in \ac{dnn} applications possible.
|
||||
|
||||
The integration of compute units at the \ac{io} region of the bank makes area-intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
|
||||
This leaves the highly optimized \ac{subarray} and \ac{psa} region as-is and only reduces the memory density by reducing the density per die to make space for the additional compute units.
|
||||
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture i.e., the maximum burst size of the memory banks.
|
||||
The integration of compute units in the \ac{io} region of the bank allows for area intensive operations such as ADD, \ac{mac} or \ac{mad} possible.
|
||||
This leaves the highly optimized \ac{subarray} and \ac{psa} regions as they are, and only reduces the memory density by reducing the density per die to make room for the additional compute units.
|
||||
However, the achievable level of parallelism is lower than in the other approaches and is defined by the prefetch architecture, i.e., the maximum burst size of the memory banks.
|
||||
|
||||
Placing the compute units at the \ac{io} region of the \ac{dram} has the fewest physical restrictions and makes complex accelerators possible.
|
||||
On the downside, the bank parallelism cannot be exploited to perform multiple computations concurrently on a bank-wise level.
|
||||
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is far higher than in the other approaches.
|
||||
Placing the compute units in the \ac{io} region of the \ac{dram} has the fewest physical limitations and allows for complex accelerators possible.
|
||||
The downside is that bank parallelism cannot be exploited to perform multiple computations simultaneously at the bank level.
|
||||
Also, the energy required to move data to the \ac{io} boundary of the \ac{dram} is much higher than in the other approaches.
|
||||
|
||||
In the following, three \ac{pim} approaches that place the compute units at the bank \ac{io} boundary are highlighted in more detail.
|
||||
|
||||
|
||||
35
src/doc.bib
35
src/doc.bib
@@ -394,6 +394,41 @@
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3XHCI9KG/Oliveira et al. - 2023 - DaPPA A Data-Parallel Framework for Processing-in.pdf}
|
||||
}
|
||||
|
||||
@inproceedings{seshadri2013,
|
||||
title = {{{RowClone}}: Fast and Energy-Efficient in-{{DRAM}} Bulk Data Copy and Initialization},
|
||||
shorttitle = {{{RowClone}}},
|
||||
booktitle = {Proceedings of the 46th {{Annual IEEE}}/{{ACM International Symposium}} on {{Microarchitecture}}},
|
||||
author = {Seshadri, Vivek and Kim, Yoongu and Fallin, Chris and Lee, Donghyuk and Ausavarungnirun, Rachata and Pekhimenko, Gennady and Luo, Yixin and Mutlu, Onur and Gibbons, Phillip B. and Kozuch, Michael A. and Mowry, Todd C.},
|
||||
year = {2013},
|
||||
month = dec,
|
||||
pages = {185--197},
|
||||
publisher = {{ACM}},
|
||||
address = {{Davis California}},
|
||||
doi = {10.1145/2540708.2540725},
|
||||
url = {https://dl.acm.org/doi/10.1145/2540708.2540725},
|
||||
urldate = {2024-02-05},
|
||||
isbn = {978-1-4503-2638-4},
|
||||
langid = {english},
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/85WGY7ZW/Seshadri et al. - 2013 - RowClone fast and energy-efficient in-DRAM bulk d.pdf}
|
||||
}
|
||||
|
||||
@misc{seshadri2020,
|
||||
title = {In-{{DRAM Bulk Bitwise Execution Engine}}},
|
||||
author = {Seshadri, Vivek and Mutlu, Onur},
|
||||
year = {2020},
|
||||
month = apr,
|
||||
number = {arXiv:1905.09822},
|
||||
eprint = {1905.09822},
|
||||
primaryclass = {cs},
|
||||
publisher = {{arXiv}},
|
||||
url = {http://arxiv.org/abs/1905.09822},
|
||||
urldate = {2024-02-05},
|
||||
abstract = {Many applications heavily use bitwise operations on large bitvectors as part of their computation. In existing systems, performing such bulk bitwise operations requires the processor to transfer a large amount of data on the memory channel, thereby consuming high latency, memory bandwidth, and energy. In this paper, we describe Ambit, a recently-proposed mechanism to perform bulk bitwise operations completely inside main memory. Ambit exploits the internal organization and analog operation of DRAM-based memory to achieve low cost, high performance, and low energy. Ambit exposes a new bulk bitwise execution model to the host processor. Evaluations show that Ambit significantly improves the performance of several applications that use bulk bitwise operations, including databases.},
|
||||
archiveprefix = {arxiv},
|
||||
keywords = {Computer Science - Hardware Architecture,Computer Science - Performance},
|
||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/3J45PFD2/Seshadri und Mutlu - 2020 - In-DRAM Bulk Bitwise Execution Engine.pdf;/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DTK64DHZ/1905.html}
|
||||
}
|
||||
|
||||
@misc{src2021,
|
||||
title = {Decadal {{Plan}} for {{Semiconductors}}},
|
||||
author = {{SRC}},
|
||||
|
||||
Reference in New Issue
Block a user