master-thesis-presentation/slides/simulations.md

## Simulations
### Microbenchmarks
<hr/>

<br>

<div class="grid grid-cols-2 gap-4">
<div>

- Vector benchmarks (BLAS level 1)
    - VADD: $z = x + y$
    - VMUL: $z = x \cdot y$
    - HAXPY: $z = a \cdot x + y$

- Vector-Matrix benchmarks (BLAS level 2)
    - GEMV: $z = A \cdot x$
    - Simple DNN:
      - $f(x) = z = ReLU(A \cdot x)$
      - $z_{n+1} = f(z_n)$
      - 5 layers in total

</div>
<div>

<br>

| Level | Vector | GEMV          | DNN           |
|-------|--------|---------------|---------------|
| X1    | (2M)   | (1024 x 4096) | (256 x 256)   |
| X2    | (4M)   | (2048 x 4096) | (512 x 512)   |
| X3    | (8M)   | (4096 x 8192) | (1024 x 1024) |
| X4    | (16M)  | (4096 x 8192) | (2048 x 2048) |

Operand Dimensions

</div>
</div>

<!--
- operand data significantly larger than on-chip cache
-->

---

## Simulations
### System Configuration
<hr/>

<br>
<br>
<br>

<div class="grid grid-cols-2 gap-4">
<div>

#### Two simulated systems:

<br>

- Generic ARM system
- Infinite compute system
  - unrealistic frequency of 100 GHz
  - completely memory bound
  - lower bound of possible speedup

</div>

<div>

<br>

#### Two real GPUs using HBM2:

<br>

- AMD RX Vega 56
- NVIDIA Tesla V100

</div>
</div>

---
layout: figure
figureUrl: /speedup_normal.svg
figureCaption: Speedups of PIM compared to non-PIM
---

## Simulations
### Speedups / Generic ARM System
<hr/>

---
layout: figure
figureUrl: /speedup_inf.svg
figureCaption: Speedups of PIM compared to non-PIM
---

## Simulations
### Speedups / Infinite Compute System
<hr/>

<!--
- VADD: 12.7x
- GEMV: 9.0x
-->

---
layout: figure
figureUrl: /samsung.svg
figureCaption: Speedups of Samsung for VADD and GEMV
---

## Simulations
### Speedups / Samsung
<hr/>

<Footnotes separator>
  <Footnote>
  Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
</Footnote>
</Footnotes>

<!--
- GEMV matches good
- ADD shows deviation

-> differences in hardware architecture
- GPU has no speculative execution
-->

---
layout: figure
figureUrl: /runtimes_vector.svg
figureCaption: Runtimes for Vector Benchmarks
---

## Simulations
### Runtimes / Vector Benchmarks
<hr/>

<!--
- Real GPUs use multiple memory channels
- Memory barriers
- Also architectural differences
-->

---
layout: figure
figureUrl: /runtimes_matrix.svg
figureCaption: Runtimes for Matrix Benchmarks
---

## Simulations
### Runtimes / Matrix Benchmarks
<hr/>