## Simulations ### Microbenchmarks

- Vector benchmarks (BLAS level 1) - VADD: $z = x + y$ - VMUL: $z = x \cdot y$ - HAXPY: $z = a \cdot x + y$ - Vector-Matrix benchmarks (BLAS level 2) - GEMV: $z = A \cdot x$ - Simple DNN: - $f(x) = z = ReLU(A \cdot x)$ - $z_{n+1} = f(z_n)$ - 5 layers in total

| Level | Vector | GEMV | DNN | |-------|--------|---------------|---------------| | X1 | (2M) | (1024 x 4096) | (256 x 256) | | X2 | (4M) | (2048 x 4096) | (512 x 512) | | X3 | (8M) | (4096 x 8192) | (1024 x 1024) | | X4 | (16M) | (4096 x 8192) | (2048 x 2048) | Operand Dimensions
--- ## Simulations ### System Configuration



#### Two simulated systems:
- Generic ARM system - Infinite compute system - unrealistic frequency of 100 GHz - completely memory bound - lower bound of possible speedup

#### Two real GPUs using HBM2:
- AMD RX Vega 56 - NVIDIA Tesla V100
--- layout: figure figureUrl: /speedup_normal.svg figureCaption: Speedups of PIM compared to non-PIM --- ## Simulations ### Speedups / Generic ARM System
--- layout: figure figureUrl: /speedup_inf.svg figureCaption: Speedups of PIM compared to non-PIM --- ## Simulations ### Speedups / Infinite Compute System
--- layout: figure figureUrl: /samsung.svg figureCaption: Speedups of Samsung for VADD and GEMV --- ## Simulations ### Speedups / Samsung
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021. --- layout: figure figureUrl: /runtimes_vector.svg figureCaption: Runtimes for Vector Benchmarks --- ## Simulations ### Runtimes / Vector Benchmarks
--- layout: figure figureUrl: /runtimes_matrix.svg figureCaption: Runtimes for Matrix Benchmarks --- ## Simulations ### Runtimes / Matrix Benchmarks