master-thesis/src/chapters/conclusion.tex

\section{Conclusion and Future Work}
\label{sec:conclusion}

In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications.
A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail.
The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed.
A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been developed, as well as a software support library to enable the use of the \aca{fimdram} processing units from a user application.
This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way.

It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations.
While these results may not strictly represent a real-world system, an achievable speedup of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system.
This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$.
In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared.

However, there is still room for improvement in the software model and the comparison methodology, which will be the subject of future work.
Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units.
This protocol is used to implement mode switching and to transfer the microkernels.
For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}.
To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel.

The current bare-metal deployment of the user application cannot realistically be used to accelerate complex real-world \ac{dnn} applications.
Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver.
To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$.

For a better evaluation of the performance gains of \aca{fimdram}, it should then be compared with real-world \ac{dnn} applications.
Effects such as the initialization overhead of \aca{fimdram} can only be realistically evaluated in such an environment.
Furthermore, the support software implementation for \aca{fimdram} should be extended to execute on the provided \ac{gpu} of gem5, so that the comparison can be extended to the deployment of real \ac{dnn} applications.
This would provide a considerably better basis for analyzing the effects of \ac{pim} on real applications running on representative hardware models.

Further research could also investigate whether the library-based approach of leveraging \ac{pim} could be replaced by a compiler-based approach.
A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands and the provided arithmetic operation.
This extension could also make use of so-called non-temporal instructions, which bypass the cache hierarchy on a per-instruction basis instead of preallocating the entire \ac{pim}-enabled memory as non-cacheable.

In addition to the performance comparison, further research should also model and compare the power efficiency gain of \ac{pim} to the non-\ac{pim} case.
Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually transfer data out of the \ac{dram} and therefore does not need to drive the data bus during operation, it promises good improvements in this area.
However, such research would require a detailed power model of both \aca{hbm} and \aca{fimdram}.

In conclusion, \ac{pim} is a promising approach to address the future processing and power efficiency needs of \ac{ai} and possibly other applications.
Research needs to consider not only the architecture itself, but also the integration of \ac{pim} into applications at the software level.
By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms.

% what to do better:
    % implement samsungs real mode switching and programming of crfs
    % build an api that guarantees matching LD and ST for the assembled microkernel
    % implement linux kernel driver
    % -> alignment requirements -> huge tables

    % make use of sasmsung pim in a real dnn application and measure the effects
    % compare with SIMD insts in ARM
    % compare with real TPUs and GPU platforms