diff --git a/src/chapters/conclusion.tex b/src/chapters/conclusion.tex index e794d55..ad419c6 100644 --- a/src/chapters/conclusion.tex +++ b/src/chapters/conclusion.tex @@ -4,27 +4,38 @@ In this thesis, the applicability of \ac{pim} was explored, taking into account the highly demanded \ac{dnn} algorithms for \ac{ai} applications. A general overview of different types of \ac{pim} implementations was given, with some concrete implementations highlighted in detail. The \ac{pim} implementation of the major \ac{dram} vendor Samsung, \ac{fimdram}/\aca{fimdram}, was specifically discussed and analyzed. -A working \ac{vp} of \aca{fimdram}, in the form of a software model, was developed, making it possible to explore the performance gain of \ac{pim} for various different applications in an easy and flexible way. -It was found that, ... (TODO: hier Ergebnisse). +A working \ac{vp} of \aca{fimdram}, in the form of a software model, has been developed, as well as a software support library to enable the use of the \aca{fimdram} processing units from a user application. +This made it possible to explore the performance gain of \ac{pim} for different workloads in a simple and flexible way. -However, there is still room for improvement in the software model or the comparison methodology, which will be the subject of future work. -Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, used to implement the mode switching and transferring of the microkernels. +It was found that \ac{pim} can provide a speedup of up to $\qty{23.9}{\times}$ for level 1 \ac{blas} vector operations and up to $\qty{62.5}{\times}$ for level 2 \ac{blas} operations. +While these results may not strictly represent a real-world system, an achievable upper bound of speedups of $\qty{17.6}{\times}$ and $\qty{9.0}{\times}$ could be determined using a hypothetical infinite compute system. +This achieved speedup of $\qty{9.0}{\times}$ for the \ac{gemv} routine largely matches the number of Samsung's real-world implementation of \aca{fimdram} at about $\qty{8.3}{\times}$. +In addition to the numbers presented by Samsung, the same simulation workloads were run on two real \ac{gpu} systems, both with \aca{hbm}, and their runtimes were compared. + +However, there is still room for improvement in the software model and the comparison methodology, which will be the subject of future work. +Firstly, the developed software library and the implemented model are not yet a drop-in replacement for the real \aca{fimdram} implementation due to the custom communication protocol between the host processor and the \ac{pim} processing units, which is used to implement the mode switching and the transfer of the microkernels. For this, more detailed information is required from Samsung, as the exact interface of \aca{fimdram} is not described in the published papers \cite{kwon2021}, \cite{lee2021} and \cite{kang2022}. To ease the currently error-prone microkernel development process, the software library could help the developer by providing building blocks that assemble the microkernel and simultaneously generate the necessary \ac{ld} and \ac{st} instructions to execute the kernel. -In addition, the current bare-metal deployment of the software cannot realistically be used to accelerate real-world \ac{dnn} applications. + +The current bare-metal deployment of the user application cannot realistically be used to accelerate complex real-world \ac{dnn} applications. Instead, \aca{fimdram} should be able to be used on a Linux system, which would require the integration of the software support library into a Linux device driver. To take into account the special alignment requirements of the \ac{pim} data structures, this device driver must also carefully consider the virtual address translation of the Linux kernel, possibly making use of so-called \acp{hugetlb}, as the alignment requirements exceed the default page size of $\qty{4}{\kibi\byte}$. -For a better evaluation of the performance gains of \aca{fimdram}, it should be also compared with real-world \ac{dnn} applications. -Effects such as the initialization overhead of \aca{fimdram} can only be evaluated in such an environment. -Furthermore, the integration of \aca{fimdram} should be extended to \acp{gpu} or \acp{tpu}, so that the comparison can be extended to the deployment of the real \ac{dnn} applications. +For a better evaluation of the performance gains of \aca{fimdram}, it should then be compared with real-world \ac{dnn} applications. +Effects such as the initialization overhead of \aca{fimdram} can only be realistically evaluated in such an environment. +Furthermore, the support software implementation for \aca{fimdram} should be extended to execute on the provided \ac{gpu} of gem5, so that the comparison can be extended to the deployment of real \ac{dnn} applications. +This would provide a considerably better basis for analyzing the effects of \ac{pim} on real applications running on representative hardware models. Further research could also investigate whether the library-based approach of leveraging \ac{pim} could be replaced by a compiler-based approach. -A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands. -This extension might also make use of so-called non-temporal instructions that bypass the cache hierarchy on a per-instruction basis. +A special compiler extension would be able to generate the necessary \ac{ld} and \ac{st} instructions by analyzing the data types of the operands and the provided arithmetic operation. +This extension could also make use of so-called non-temporal instructions, which bypass the cache hierarchy on a per-instruction basis instead of preallocating the entire \ac{pim}-enabled memory as non-cacheable. -In conclusion, \ac{pim} is a promising approach to address the future processing needs of \ac{ai} and possibly other applications. -Not only the architecture itself has to be considered, but also the integration of \ac{pim} into the applications at the software level. +In addition to the performance comparison, further research should also model and compare the power efficiency gain of \ac{pim} to the non-\ac{pim} case. +Since \ac{pim} not only provides a shorter computation time per operation, but also does not actually drive the memory data bus during operation, it promises good improvements in this area. +However, this would require a detailed performance model of both \aca{hbm} and \aca{fimdram}. + +In conclusion, \ac{pim} is a promising approach to address the future processing and power efficiency needs of \ac{ai} and possibly other applications. +Research needs to consider not only the architecture itself, but also the integration of \ac{pim} into applications at the software level. By overcoming these challenges, \ac{pim} could be part of the solution to increase the performance and energy efficiency of future computing platforms. % what to do better: diff --git a/src/chapters/dram.tex b/src/chapters/dram.tex index 44ab5a6..cc0892e 100644 --- a/src/chapters/dram.tex +++ b/src/chapters/dram.tex @@ -128,6 +128,6 @@ In the center of the die, the \acp{tsv} connect the die to the next die above it \end{figure} % still, bandwidth requirements of new AI applications are not met by HBM2:waq -Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bounded limitations. +Although \aca{hbm} provides a high amount of bandwidth, many modern \acp{dnn} applications reside in the memory-bound limitations. While one approach would be to further increase the bandwidth by integrating more stacks on the silicon interposer, other constraints such as thermal limits or the limited number of \ac{io} connections on the interposer may make this impractical \cite{lee2021}. Another approach could be \acf{pim}: Using \ac{hbm}'s 2.5D architecture, it is possible to incorporate additional compute units directly into the memory stacks, increasing the achievable parallel bandwidth and reducing the burden of transferring all the data to the host processor for performing operations on it. diff --git a/src/chapters/implementation/vm.tex b/src/chapters/implementation/vm.tex index 8574f51..d26fbbf 100644 --- a/src/chapters/implementation/vm.tex +++ b/src/chapters/implementation/vm.tex @@ -6,7 +6,8 @@ To implement \aca{fimdram} in \aca{hbm}, the \ac{dram} model of DRAMSys has to b They also need to be provided it with the burst data from the \acp{ssa} as well as the burst address to calculate the register indices in the \ac{aam} operation mode. However, no changes are required in the frontend or backend of DRAMSys, as already described in \cref{sec:pim_fim} no changes are required in the memory controller. In addition, since a single \ac{dram} \ac{rd} or \ac{wr} command triggers the execution of a single microkernel instruction, the processing unit is fully synchronized with the read and write operations of the \ac{dram}. -As a result, the \aca{fimdram} model itself does not need to model any timing behavior: its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys. +As a result, the \aca{fimdram} model itself does not need to model any timing behavior: +Its submodel is essentially untimed, since it is already synchronized with the operation of the \ac{dram} model of DRAMSys. This leads to a significantly simplified model, since the internal pipeline stages of \aca{fimdram} do not need to be modeled, but only the functional behavior of a processing unit to the outside. While \aca{fimdram} operates in the default \ac{sb} mode, it behaves exactly like a normal \aca{hbm} memory. @@ -21,7 +22,7 @@ With more information from Samsung on how the actual mechanism is implemented, t When entering \ac{ab} mode, the \ac{dram} model ignores the specific bank address of incoming \ac{wr} commands and internally performs the write operation for either all even or all odd banks of the \ac{pch}, depending on the parity of the original bank index. This mode can be used by the host to initialize the input vector chunk interleaving as described in \cref{sec:memory_layout}, or to initialize the \ac{crf} of the processing unit with the microkernel, which should be the same for all operating banks. -After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{ab}-\ac{pim} mode, which allows the execution of instructions in the processing units. +After the transition to the \ac{ab} mode, the \ac{dram} can further transition to the \ac{abp} mode, which allows the execution of instructions in the processing units. The \ac{abp} mode is similar to the \ac{ab} mode in that it also ignores the concrete bank address except for its parity, while additionally passing the column and row address and, in the case of a read, also the respective fetched bank data to the processing units. In the case of a write access, the output of the processing unit is written directly into the corresponding bank, ignoring the actual data of the transaction object coming from the host processor. This is equivalent to the real \aca{fimdram} implementation, where the global \ac{io} bus of the memory is not actually driven, and all data movement is done internally in the banks. diff --git a/src/chapters/introduction.tex b/src/chapters/introduction.tex index 8692858..aacbc81 100644 --- a/src/chapters/introduction.tex +++ b/src/chapters/introduction.tex @@ -28,8 +28,8 @@ In recent years, domain-specific accelerators, such as \acp{gpu} or \acp{tpu} ha However, research must also take into account off-chip memory - moving data between the computation unit and the \ac{dram} is very costly, as fetching operands consumes more power than performing the computation on them itself. While performing a double precision floating point operation on a $\qty{28}{\nano\meter}$ technology might consume an energy of about $\qty{20}{\pico\joule}$, fetching the operands from \ac{dram} consumes almost 3 orders of magnitude more energy at about $\qty{16}{\nano\joule}$ \cite{dally2010}. -Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bounded} \cite{he2020}. -In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bounded}. +Furthermore, many types of \acp{dnn} used for language and speech processing, such as \acp{rnn}, \acp{mlp} and some layers of \acp{cnn}, are severely limited by the memory bandwidth that the \ac{dram} can provide, making them \textit{memory-bound} \cite{he2020}. +In contrast, compute-intensive workloads, such as visual processing, are referred to as \textit{compute-bound}. \begin{figure}[!ht] \centering @@ -41,9 +41,9 @@ In contrast, compute-intensive workloads, such as visual processing, are referre In the past, specialized types of \ac{dram} such as \ac{hbm} have been able to meet the high bandwidth requirements. However, recent \ac{ai} technologies require even greater bandwidth than \ac{hbm} can provide \cite{kwon2021}. -All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bounded, new approaches to computing are required. +All things considered, to meet the need for more energy-efficient computing systems, which are increasingly becoming memory-bound, new approaches to computing are required. This has led researchers to reconsider past \ac{pim} architectures and advance them further \cite{lee2021}. -\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bounded problems. +\Ac{pim} integrates computational logic into the \ac{dram} itself, to exploit minimal data movement cost and extensive internal data parallelism \cite{sudarshan2022}, making it a good fit for memory-bound problems. This work analyzes various \ac{pim} architectures, identifies the challenges of integrating them into state-of-the-art \acp{dram}, examines the changes required in the way applications lay out their data in memory and explores a \ac{pim} implementation from one of the leading \ac{dram} vendors. The remainder of this work is structured as follows: diff --git a/src/chapters/pim.tex b/src/chapters/pim.tex index 6da32e2..4a46ccc 100644 --- a/src/chapters/pim.tex +++ b/src/chapters/pim.tex @@ -151,7 +151,7 @@ The 16-wide \ac{simd} units correspond to the 256-bit prefetch architecture of \ As all \ac{pim} units operate in parallel, with 16 banks per \ac{pch}, a singular memory access loads a total of $\qty{256}{\bit}\cdot\qty{8}{processing\ units}=\qty{2048}{\bit}$ into the \acp{fpu}. As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\times}$ higher than the external bus bandwidth to the host processor. -\Ac{hbm}-\ac{pim} defines three operating modes: +\Aca{fimdram} defines three operating modes: \begin{enumerate} \item \textbf{\Ac{sb} Mode}: This is the default operating mode, where \aca{fimdram} has identical behavior to normal \aca{hbm} memory. @@ -165,13 +165,13 @@ As a result, the theoretical internal bandwidth of \aca{fimdram} is $\qty{8}{\ti In addition, the \ac{io} circuits of the \ac{dram} are completely disabled in this mode, reducing the power required during \ac{pim} operation. \end{enumerate} -Both in \ac{ab} mode and in \ac{ab}-\ac{pim} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}. +Both in \ac{ab} mode and in \ac{abp} mode, the total \aca{hbm} bandwidth per \ac{pch} of $\qty[per-mode=symbol]{16}{\giga\byte\per\second}$ is $\qty{8}{\times}$ higher with $\qty[per-mode=symbol]{128}{\giga\byte\per\second}$ or in total $\qty[per-mode=symbol]{2}{\tera\byte\per\second}$ for 16 \acp{pch}. \subsubsection{Processing Unit} Due to the focus on \ac{dnn} applications in \aca{fimdram}, the native data type for the \acp{fpu} is \ac{fp16}, which is motivated by the significantly lower area and power requirements for \acp{fpu} compared to \ac{fp32}. In addition, \ac{fp16} is well-supported on modern processor architectures such as ARMv8, which not only include \ac{fp16} \acp{fpu} themselves, but also support \ac{simd} operations using special vector registers. -The \ac{simd} \ac{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. +The \ac{simd} \acp{fpu} of the processing units is implemented once as a \ac{fp16} multiplier unit, and once as a \ac{fp16} adder unit, providing support for these basic algorithmic operations. In addition to the \acp{fpu}, a processing unit consists also of \acp{crf}, \acp{srf} and \acp{grf}. The \ac{crf} acts as an instruction buffer, holding the 32 32-bit instructions to be executed by the processor when performing a memory access. One program that is stored in the \ac{crf} is called a \textit{microkernel}. @@ -406,7 +406,7 @@ The following \cref{sec:vp} introduces the concept of virtual prototyping, which \begin{landscape} \begin{figure} \input{images/matrix_layout} -\caption[Mapping of the weight matrix onto the memory banks and its layout in the linear address space.]{Mapping of the weight matrix onto the memory banks and its layout in the linear address space.} +\caption{Mapping of the weight matrix onto the memory banks and its layout in the linear address space.} \label{img:matrix_layout} \end{figure} \end{landscape} diff --git a/src/chapters/results.tex b/src/chapters/results.tex index aef4f8d..ff94cd8 100644 --- a/src/chapters/results.tex +++ b/src/chapters/results.tex @@ -83,7 +83,7 @@ This allows an exaggerated evaluation of the performance gains of \ac{pim} in an % dann HAXPY The first set of benchmarks analyzes the speedup of \aca{fimdram} for various vector operations, namely an element-wise vector add operation (VADD), an element-wise vector multiply operation (VMUL), and a \ac{haxpy} operation. -Such vector operations have a low operational density and are particularly memory-bounded because there is no data reuse at all and two input operands must be loaded for each operation. +Such vector operations have a low operational density and are particularly memory-bound because there is no data reuse at all and two input operands must be loaded for each operation. As a result, the on-chip cache does not accelerate such workloads because all operand data must be fetched from memory anyway. The workloads adhere to the following calculation patterns: @@ -148,7 +148,7 @@ As all speedup values are well above 1, it can be concluded that even the smalle \end{figure} In addition to the generic ARM-based system, the same benchmarks were run on the hypothetical infinite compute system, the results of which are shown in \cref{fig:vector_infinite}. -As it can be seen, the achievable speedup in the completely memory-bounded system is with a range of $\qtyrange{10.2}{17.6}{\times}$ lower than in the generic system. +As it can be seen, the achievable speedup in the completely memory-bound system is with a range of $\qtyrange{10.2}{17.6}{\times}$ lower than in the generic system. This is expected as the system becomes completely memory-bound and no longer relies on the relatively slow ARM processor. The variance in speedup between different vector dimensions is also fairly low. % For the \ac{haxpy} benchmark, the smaller variance of $\qtyrange{2.0}{2.4}{\times}$ can be interpreted as follows: @@ -263,7 +263,7 @@ Therefore, the simulations can be directly compared to gain a good understanding Each of Samsung's benchmarks is run with different batch sizes, where a larger batch size allows for better cache utilization as multiple operations are performed on the same data set, making the workload less memory-bound and therefore \ac{pim} less effective. All the microbenchmarks discussed so far do not perform batching, so all comparisons are made against the results for the batch size of 1, which correspond to the blue bars in \cref{fig:samsung_speedup}. -Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the \ac{cpu} approach, as both systems should be operating in the memory-bounded region. +Since the Samsung \ac{fpga} platform can be assumed to be a highly optimized accelerator, the infinite compute approach would be a more viable baseline for comparison than the \ac{cpu} approach, as both systems should be operating in the memory-bound region. \begin{figure} \centering @@ -284,9 +284,16 @@ In summary, the results for the VADD workload show some deviation from the real- \subsubsection{Comparison to Real Hardware} In addition to comparing Samsung's real hardware implementation, the same benchmarks of the simulations performed are run on two real \ac{gpu} systems, here referred to as Vega and Tesla. -The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from Nvidia, specifically tailored for \ac{hpc}. -Both \acp{gpu} make use of \aca{hbm} and therefore are greatly suited to classify the simulation results and get an overview of the workload runtimes on a real system. -As both systems are using generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements are only intended to serve as a vague estimation of the runtimes in a non-\ac{pim} case. +The former system is the consumer \ac{gpu} \textit{Radeon RX Vega 56} from AMD, while the latter is the \textit{Tesla V100} \ac{gpu} from NVIDIA, specifically tailored for \ac{hpc}. +This Tesla \ac{gpu} is only one of the in total 16 \acp{gpu} that are part of the NVIDIA DG-X2 \ac{ai} workstation. +Both \acp{gpu} make use of \aca{hbm} and are therefore well suited to classify the simulation results and get an overview of the workload runtimes on a real system. +Since both systems use generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the measurements should only be used as a rough estimate of the runtimes in a non-\ac{pim} case. + +The Vega \ac{gpu} integrates $\qty{8}{\gibi\byte}$ of \aca{hbm} memory using two stacks, achieving a complete bus width of $\qty{2048}{\bit}$ with a total of 16 memory channels. +For the theoretical performance of \ac{fp16} operations, Vega achieves a value of $\qty{21.09}{\peta FLOPS}$ \cite{vega2017}. + +While being specifically tailored for \ac{ai} applications, the Tesla \ac{gpu} integrates $\qty{32}{\gibi\byte}$ of \aca{hbm} using four stacks, resulting in a total bus width of $\qty{4096}{\bit}$ and 32 independent memory channels. +The theoretical performance of \ac{fp16} operations is only slightly higher than the of the Vega \ac{gpu} with a value of $\qty{28.26}{\peta FLOPS}$ \cite{tesla2018}. \begin{figure} \centering @@ -306,6 +313,16 @@ As both systems are using generic \aca{hbm} \ac{dram} and not \aca{fimdram}, the \label{fig:runtimes_matrix} \end{figure} +A comparison between all investigated systems for the vector benchmarks is shown in \cref{fig:runtimes_vector}. +As it can be seen, for both the generic ARM system, and the infinite compute system, the usage of \ac{pim} reduces the total runtime significantly. +However, when comparing the two \ac{gpu} systems with the infinite compute approach, it can be seen that the runtime of the \acp{gpu} is actually lower, even though the infinite compute approach should be an optimal memory-bound system. +It is important to note that while the simulation considered only one memory channel, the real \acp{gpu} could distribute all operations over all available channels, 16 channels in the case of Vega and 32 channels in the case of Tesla. +This results in significantly higher overall memory bandwidth for the \ac{gpu} systems. +It should also be noted that the \aca{hbm} memory of the Tesla \ac{gpu} is clocked at a slightly higher frequency of $\qty{876}{\mega\hertz}$ than the Vega \ac{gpu} with a frequency of $\qty{800}{\mega\hertz}$ \cite{vega2017,tesla2018}. +In addition, while the infinite compute system does not use any computing time, it may still need to stall and wait for memory requests to complete. +This is especially true for the explicitly inserted memory barriers in the \ac{pim} kernels. +With these things in mind, the faster execution time of \ac{gpu} systems compared to the infinite compute system may be explained this way. + % \subsubsection{Initialization Overhead} % conversion der operanden im verhältnis zur laufzeit abschätzen diff --git a/src/images/hbm.pdf b/src/images/hbm.pdf index 41d4238..1427063 100644 Binary files a/src/images/hbm.pdf and b/src/images/hbm.pdf differ diff --git a/src/plots/runtime_tables/tesla.csv b/src/plots/runtime_tables/tesla.csv index b0471fa..642be5e 100644 --- a/src/plots/runtime_tables/tesla.csv +++ b/src/plots/runtime_tables/tesla.csv @@ -1,5 +1,5 @@ level,vadd,vmul,haxpy,gemv,dnn -X1,69572650,67408281,69791189,750246152,231093065 -X2,123217536,103994272,123543145,648714601,431703456 -X3,207693503,182162140,207947543,2454455479,877622611 -X4,378089165,350280326,377434890,4968984949,2175751385 +X1,17282586,17180880,17121019,16984363,91489038 +X2,31633105,31633649,31802257,26425737,151112206 +X3,60059785,60065489,60021288,86860818,142148495 +X4,116919805,116812209,116847802,166443969,89307502 diff --git a/src/tables/torch.csv b/src/tables/torch.csv index f022923..8ade9f9 100644 --- a/src/tables/torch.csv +++ b/src/tables/torch.csv @@ -1,21 +1,21 @@ workload,level,vega,tesla -VADD,X1,69572650,69572650 -VADD,X2,123217536,123217536 -VADD,X3,207693503,207693503 -VADD,X4,378089165,378089165 -VMUL,X1,67408281,67408281 -VMUL,X2,103994272,103994272 -VMUL,X3,182162140,182162140 -VMUL,X4,350280326,350280326 -HAXPY,X1,69791189,69791189 -HAXPY,X2,123543145,123543145 -HAXPY,X3,207947543,207947543 -HAXPY,X4,377434890,377434890 -GEMV,X1,750246152,750246152 -GEMV,X2,648714601,648714601 -GEMV,X3,2454455479,2454455479 -GEMV,X4,4968984949,4968984949 -DNN,X1,231093065,231093065 -DNN,X2,431703456,431703456 -DNN,X3,877622611,877622611 -DNN,X4,2175751385,2175751385 +VADD,X1,69572650,17282586 +VADD,X2,123217536,31633105 +VADD,X3,207693503,60059785 +VADD,X4,378089165,116919805 +VMUL,X1,67408281,17180880 +VMUL,X2,103994272,31633649 +VMUL,X3,182162140,60065489 +VMUL,X4,350280326,116812209 +HAXPY,X1,69791189,17121019 +HAXPY,X2,123543145,31802257 +HAXPY,X3,207947543,60021288 +HAXPY,X4,377434890,116847802 +GEMV,X1,750246152,16984363 +GEMV,X2,648714601,26425737 +GEMV,X3,2454455479,86860818 +GEMV,X4,4968984949,166443969 +DNN,X1,231093065,91489038 +DNN,X2,431703456,151112206 +DNN,X3,877622611,142148495 +DNN,X4,2175751385,89307502