51 lines
5.5 KiB
TeX
51 lines
5.5 KiB
TeX
\section{Conclusion and Future Work}
|
|
\label{sec:conclusion}
|
|
|
|
In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications.
|
|
A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail.
|
|
The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed.
|
|
A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been developed, as well as a software support library to enable the use of the \aca{fimdram} processing units from a user application.
|
|
This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.
|
|
|
|
It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
|
|
While these results may not strictly represent a real-world system, an achievable speedup of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
|
|
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
|
|
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.
|
|
|
|
However, there is still room for improvement in the software model and the comparison methodology, which will be the subject of future work.
|
|
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units.
|
|
This protocol is used to implement mode switching and to transfer the microkernels.
|
|
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
|
|
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.
|
|
|
|
The current bare-metal deployment of the user application cannot realistically be used to accelerate complex real-world \ac{dnn} applications.
|
|
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
|
|
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$.
|
|
|
|
For a better evaluation of the performance gains of \aca{fimdram}, it should then be compared with real-world \ac{dnn} applications.
|
|
Effects such as the initialization overhead of \aca{fimdram} can only be realistically evaluated in such an environment.
|
|
Furthermore, the support software implementation for \aca{fimdram} should be extended to execute on the provided \ac{gpu} of gem5, so that the comparison can be extended to the deployment of real \ac{dnn} applications.
|
|
This would provide a considerably better basis for analyzing the effects of \ac{pim} on real applications running on representative hardware models.
|
|
|
|
Further research could also investigate whether the library-based approach of leveraging \ac{pim} could be replaced by a compiler-based approach.
|
|
A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands and the provided arithmetic operation.
|
|
This extension could also make use of so-called non-temporal instructions, which bypass the cache hierarchy on a per-instruction basis instead of preallocating the entire \ac{pim}-enabled memory as non-cacheable.
|
|
|
|
In addition to the performance comparison, further research should also model and compare the power efficiency gain of \ac{pim} to the non-\ac{pim} case.
|
|
Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually transfer data out of the \ac{dram} and therefore does not need to drive the data bus during operation, it promises good improvements in this area.
|
|
However, such research would require a detailed power model of both \aca{hbm} and \aca{fimdram}.
|
|
|
|
In conclusion, \ac{pim} is a promising approach to address the future processing and power efficiency needs of \ac{ai} and possibly other applications.
|
|
Research needs to consider not only the architecture itself, but also the integration of \ac{pim} into applications at the software level.
|
|
By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms.
|
|
|
|
% what to do better:
|
|
% implement samsungs real mode switching and programming of crfs
|
|
% build an api that guarantees matching LD and ST for the assembled microkernel
|
|
% implement linux kernel driver
|
|
% -> alignment requirements -> huge tables
|
|
|
|
% make use of sasmsung pim in a real dnn application and measure the effects
|
|
% compare with SIMD insts in ARM
|
|
% compare with real TPUs and GPU platforms
|