156 lines
2.4 KiB
Markdown
156 lines
2.4 KiB
Markdown
## Simulations
|
||
### Microbenchmarks
|
||
<hr/>
|
||
|
||
<br>
|
||
|
||
<div class="grid grid-cols-2 gap-4">
|
||
<div>
|
||
|
||
- Vector benchmarks (BLAS level 1)
|
||
- VADD: $z = x + y$
|
||
- VMUL: $z = x \cdot y$
|
||
- HAXPY: $z = a \cdot x + y$
|
||
|
||
- Vector-Matrix benchmarks (BLAS level 2)
|
||
- GEMV: $z = A \cdot x$
|
||
- Simple DNN:
|
||
- $f(x) = z = ReLU(A \cdot x)$
|
||
- $z_{n+1} = f(z_n)$
|
||
- 5 layers in total
|
||
|
||
</div>
|
||
<div>
|
||
|
||
<br>
|
||
|
||
| Level | Vector | GEMV | DNN |
|
||
|-------|--------|---------------|---------------|
|
||
| X1 | (2M) | (1024 x 4096) | (256 x 256) |
|
||
| X2 | (4M) | (2048 x 4096) | (512 x 512) |
|
||
| X3 | (8M) | (4096 x 8192) | (1024 x 1024) |
|
||
| X4 | (16M) | (4096 x 8192) | (2048 x 2048) |
|
||
|
||
Operand Dimensions
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<!--
|
||
- operand data significantly larger than on-chip cache
|
||
-->
|
||
|
||
---
|
||
|
||
## Simulations
|
||
### System Configuration
|
||
<hr/>
|
||
|
||
<br>
|
||
<br>
|
||
<br>
|
||
|
||
<div class="grid grid-cols-2 gap-4">
|
||
<div>
|
||
|
||
#### Two simulated systems:
|
||
|
||
<br>
|
||
|
||
- Generic ARM system
|
||
- Infinite compute system
|
||
- unrealistic frequency of 100 GHz
|
||
- completely memory bound
|
||
- lower bound of possible speedup
|
||
|
||
</div>
|
||
|
||
<div>
|
||
|
||
<br>
|
||
|
||
#### Two real GPUs using HBM2:
|
||
|
||
<br>
|
||
|
||
- AMD RX Vega 56
|
||
- NVIDIA Tesla V100
|
||
|
||
</div>
|
||
</div>
|
||
|
||
---
|
||
layout: figure
|
||
figureUrl: /speedup_normal.svg
|
||
figureCaption: Speedups of PIM compared to non-PIM
|
||
---
|
||
|
||
## Simulations
|
||
### Speedups / Generic ARM System
|
||
<hr/>
|
||
|
||
---
|
||
layout: figure
|
||
figureUrl: /speedup_inf.svg
|
||
figureCaption: Speedups of PIM compared to non-PIM
|
||
---
|
||
|
||
## Simulations
|
||
### Speedups / Infinite Compute System
|
||
<hr/>
|
||
|
||
<!--
|
||
- VADD: 12.7x
|
||
- GEMV: 9.0x
|
||
-->
|
||
|
||
---
|
||
layout: figure
|
||
figureUrl: /samsung.svg
|
||
figureCaption: Speedups of Samsung for VADD and GEMV
|
||
---
|
||
|
||
## Simulations
|
||
### Speedups / Samsung
|
||
<hr/>
|
||
|
||
<Footnotes separator>
|
||
<Footnote>
|
||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||
</Footnote>
|
||
</Footnotes>
|
||
|
||
<!--
|
||
- GEMV matches good
|
||
- ADD shows deviation
|
||
|
||
-> differences in hardware architecture
|
||
- GPU has no speculative execution
|
||
-->
|
||
|
||
---
|
||
layout: figure
|
||
figureUrl: /runtimes_vector.svg
|
||
figureCaption: Runtimes for Vector Benchmarks
|
||
---
|
||
|
||
## Simulations
|
||
### Runtimes / Vector Benchmarks
|
||
<hr/>
|
||
|
||
<!--
|
||
- Real GPUs use multiple memory channels
|
||
- Memory barriers
|
||
- Also architectural differences
|
||
-->
|
||
|
||
---
|
||
layout: figure
|
||
figureUrl: /runtimes_matrix.svg
|
||
figureCaption: Runtimes for Matrix Benchmarks
|
||
---
|
||
|
||
## Simulations
|
||
### Runtimes / Matrix Benchmarks
|
||
<hr/>
|