Introduction to library chapter
This commit is contained in:
@@ -1,2 +1,25 @@
|
|||||||
\subsection{Software Library}
|
\subsection{Software Library}
|
||||||
\label{sec:library}
|
\label{sec:library}
|
||||||
|
|
||||||
|
With the implementation of the \ac{pim} processing units, a crucial missing piece that for simulating \aca{fimdram} is software support to effectively utilize the new architecture.
|
||||||
|
While it is possible to shift the responsibility for interacting with the \aca{fimdram} to the application developer, it is more preferable to provide a sophisticated software library that offers users an easy-to-use \ac{api} for interfacing with the \ac{pim} implementation.
|
||||||
|
|
||||||
|
Such a \ac{pim} library must include the following essential features to fully interact with the processing units in memory:
|
||||||
|
|
||||||
|
\begin{itemize}
|
||||||
|
\item It must support the \textbf{mode-setting} required to switch between \ac{sb}, \ac{ab} and \ac{abp} mode.
|
||||||
|
\item It should provide data structures to build up \textbf{microkernels} and functions to upload the kernels to the \acp{crf} of the processing units.
|
||||||
|
\item To meet the special requirements for the \textbf{memory layout} of the algorithm's inputs and outputs, it should provide data structures to represent vectors and matrices according to the layout constraints.
|
||||||
|
\item After switching the mode to \ac{abp}, the library should provide functionality to \textbf{execute a user-defined microkernel} by issuing the necessary memory requests through the execution of \ac{ld} and \ac{st} instructions.
|
||||||
|
\item For platforms, where it is not possible to mark the \ac{pim} memory regions as uncacheable, the library should provide the necessary \textbf{cache management} operations to bypass the cache filtering and to generate the right amount of \ac{rd} and \ac{wr} \ac{dram} commands.
|
||||||
|
\end{itemize}
|
||||||
|
|
||||||
|
% - mode setting
|
||||||
|
|
||||||
|
\subsubsection{Data Structures}
|
||||||
|
% - memory layout
|
||||||
|
% - microkernel programming
|
||||||
|
|
||||||
|
\subsubsection{Microkernel Execution}
|
||||||
|
% - microkernel execution
|
||||||
|
% - cache management
|
||||||
|
|||||||
@@ -1,6 +1,7 @@
|
|||||||
\subsection{Virtual Machine}
|
\subsection{Virtual Machine}
|
||||||
\label{sec:vm}
|
\label{sec:vm}
|
||||||
|
|
||||||
|
\subsubsection{Integration}
|
||||||
To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
|
To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to be extended to incorporate the processing units in the \acp{pch} of the \ac{pim}-activated channels and to provide it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode.
|
||||||
However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim}, no changes are required in the memory controller.
|
However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim}, no changes are required in the memory controller.
|
||||||
In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
|
In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}.
|
||||||
@@ -24,6 +25,7 @@ The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the con
|
|||||||
In case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object.
|
In case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object.
|
||||||
This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks.
|
This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks.
|
||||||
|
|
||||||
|
\subsubsection{Implementation}
|
||||||
So far, only the additional infrastructure in the \ac{dram} model of DRAMSys and the integration of the processing units have been described.
|
So far, only the additional infrastructure in the \ac{dram} model of DRAMSys and the integration of the processing units have been described.
|
||||||
Now follows the implementation of the processing units themselves.
|
Now follows the implementation of the processing units themselves.
|
||||||
The internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
|
The internal state of a processing unit consists of the \ac{grf} register files \ac{grf}-A and \ac{grf}-B, the \ac{srf} register files \ac{srf}-A and \ac{srf}-M, the program counter, and a jump counter that keeps track of the current iteration of a JUMP instruction.
|
||||||
|
|||||||
53
src/doc.bib
53
src/doc.bib
@@ -70,6 +70,24 @@
|
|||||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/G27933A4/Dally - 2010 - GPU Computing to Exascale and Beyond.pdf}
|
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/G27933A4/Dally - 2010 - GPU Computing to Exascale and Beyond.pdf}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@article{gabbay2022,
|
||||||
|
title = {Deep {{Neural Network Memory Performance}} and {{Throughput Modeling}} and {{Simulation Framework}}},
|
||||||
|
author = {Gabbay, Freddy and Lev Aharoni, Rotem and Schweitzer, Ori},
|
||||||
|
date = {2022-11-06},
|
||||||
|
journaltitle = {Mathematics},
|
||||||
|
shortjournal = {Mathematics},
|
||||||
|
volume = {10},
|
||||||
|
number = {21},
|
||||||
|
pages = {4144},
|
||||||
|
issn = {2227-7390},
|
||||||
|
doi = {10.3390/math10214144},
|
||||||
|
url = {https://www.mdpi.com/2227-7390/10/21/4144},
|
||||||
|
urldate = {2024-02-14},
|
||||||
|
abstract = {Deep neural networks (DNNs) are widely used in various artificial intelligence applications and platforms, such as sensors in internet of things (IoT) devices, speech and image recognition in mobile systems, and web searching in data centers. While DNNs achieve remarkable prediction accuracy, they introduce major computational and memory bandwidth challenges due to the increasing model complexity and the growing amount of data used for training and inference. These challenges introduce major difficulties not only due to the constraints of system cost, performance, and energy consumption, but also due to limitations in currently available memory bandwidth. The recent advances in semiconductor technologies have further intensified the gap between computational hardware performance and memory systems bandwidth. Consequently, memory systems are, today, a major performance bottleneck for DNN applications. In this paper, we present DRAMA, a deep neural network memory simulator. DRAMA extends the SCALE-Sim simulator for DNN inference on systolic arrays with a detailed, accurate, and extensive modeling and simulation environment of the memory system. DRAMA can simulate in detail the hierarchical main memory components—such as memory channels, modules, ranks, and banks—and related timing parameters. In addition, DRAMA can explore tradeoffs for memory system performance and identify bottlenecks for different DNNs and memory architectures. We demonstrate DRAMA’s capabilities through a set of experimental simulations based on several use cases.},
|
||||||
|
langid = {english},
|
||||||
|
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/DQ9B36IG/Gabbay et al. - 2022 - Deep Neural Network Memory Performance and Through.pdf}
|
||||||
|
}
|
||||||
|
|
||||||
@article{gao2017,
|
@article{gao2017,
|
||||||
title = {Bare-Metal {{Boot Code}} for {{ARMv8-A Processors}}},
|
title = {Bare-Metal {{Boot Code}} for {{ARMv8-A Processors}}},
|
||||||
author = {Gao, William},
|
author = {Gao, William},
|
||||||
@@ -410,6 +428,41 @@
|
|||||||
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/U92WPM5C/Radojković et al. - 2021 - Processing in Memory The Tipping Point.pdf}
|
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/U92WPM5C/Radojković et al. - 2021 - Processing in Memory The Tipping Point.pdf}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
@online{samajdar2019,
|
||||||
|
title = {{{SCALE-Sim}}: {{Systolic CNN Accelerator Simulator}}},
|
||||||
|
shorttitle = {{{SCALE-Sim}}},
|
||||||
|
author = {Samajdar, Ananda and Zhu, Yuhao and Whatmough, Paul and Mattina, Matthew and Krishna, Tushar},
|
||||||
|
date = {2019-02-01},
|
||||||
|
eprint = {1811.02883},
|
||||||
|
eprinttype = {arxiv},
|
||||||
|
eprintclass = {cs},
|
||||||
|
url = {http://arxiv.org/abs/1811.02883},
|
||||||
|
urldate = {2024-02-14},
|
||||||
|
abstract = {Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to provide principled insights on both the design trade-offs and efficient mapping strategies for systolic-array based accelerators. We introduce Systolic Array Simulator (SCALE-SIM), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-SIM exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic array simulator tuned for running DNNs to the best of our knowledge.},
|
||||||
|
langid = {english},
|
||||||
|
pubstate = {preprint},
|
||||||
|
keywords = {Computer Science - Distributed Parallel and Cluster Computing,Computer Science - Hardware Architecture},
|
||||||
|
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/9NAHVVMW/Samajdar et al. - 2019 - SCALE-Sim Systolic CNN Accelerator Simulator.pdf}
|
||||||
|
}
|
||||||
|
|
||||||
|
@inproceedings{samajdar2020,
|
||||||
|
title = {A {{Systematic Methodology}} for {{Characterizing Scalability}} of {{DNN Accelerators}} Using {{SCALE-Sim}}},
|
||||||
|
booktitle = {2020 {{IEEE International Symposium}} on {{Performance Analysis}} of {{Systems}} and {{Software}} ({{ISPASS}})},
|
||||||
|
author = {Samajdar, Ananda and Joseph, Jan Moritz and Zhu, Yuhao and Whatmough, Paul and Mattina, Matthew and Krishna, Tushar},
|
||||||
|
date = {2020-08},
|
||||||
|
pages = {58--68},
|
||||||
|
publisher = {{IEEE}},
|
||||||
|
location = {{Boston, MA, USA}},
|
||||||
|
doi = {10.1109/ISPASS48437.2020.00016},
|
||||||
|
url = {https://ieeexplore.ieee.org/document/9238602/},
|
||||||
|
urldate = {2024-02-14},
|
||||||
|
abstract = {The compute demand for deep learning workloads is well known and is a prime motivator for powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. The massive inherent parallelism of these workloads enables us to extract more performance by simply provisioning more compute hardware for a given task. This strategy can be directly exploited to build higher-performing hardware for DNN workloads, by incorporating as many parallel compute units as possible in a single system. This strategy is referred to as scaling up. Alternatively, it’s feasible to arrange multiple hardware systems to work on a single problem to exploit the given parallelism, or in other words, scaling out. As DNN based solutions become increasingly prevalent, so does the demand for computation, making the scaling choice (scale-up vs scale-out) critical.},
|
||||||
|
eventtitle = {2020 {{IEEE International Symposium}} on {{Performance Analysis}} of {{Systems}} and {{Software}} ({{ISPASS}})},
|
||||||
|
isbn = {978-1-72814-798-7},
|
||||||
|
langid = {english},
|
||||||
|
file = {/home/derek/Nextcloud/Verschiedenes/Zotero/storage/VP7GDZXP/Samajdar et al. - 2020 - A Systematic Methodology for Characterizing Scalab.pdf}
|
||||||
|
}
|
||||||
|
|
||||||
@inproceedings{seshadri2013,
|
@inproceedings{seshadri2013,
|
||||||
title = {{{RowClone}}: Fast and Energy-Efficient in-{{DRAM}} Bulk Data Copy and Initialization},
|
title = {{{RowClone}}: Fast and Energy-Efficient in-{{DRAM}} Bulk Data Copy and Initialization},
|
||||||
shorttitle = {{{RowClone}}},
|
shorttitle = {{{RowClone}}},
|
||||||
|
|||||||
Reference in New Issue
Block a user