Minor changes

This commit is contained in:
2024-04-07 22:41:59 +02:00
parent 3d15758c82
commit d634f97fb2
6 changed files with 107 additions and 61 deletions

View File

@@ -3,12 +3,13 @@
<br>
A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
- PIM can accelerate memory-bound workloads
- Special PIM-friendly memory layouts are required
<br>
Future work:
#### Future work:
- Implementation of Linux driver
- Comparison with complete neural networks
- Comparison with complete neural networks
- Consider replacing library approach with compiler approach
- Implement a power model to analyze the power efficiency gains

View File

@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files
- Provides data structures for operand data and microkernels
- Executes programmed microkernels
---
layout: figure-side
figureUrl: /bare_metal.svg
---
## Virtual Prototype
### Platform
<hr/>
<br>
<br>
- Bare-metal kernel executes on ARM processor model
- Custom page table configuration
- Non-PIM DRAM region mapped as cacheable memory
- PIM DRAM region mapped as non-cacheable memory
- generate RD and WR requests
---
## Virtual Prototype
### Platform
### GEMV Kernel
<hr/>
<br>
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
<div>
DRAM-side
```asm{all|1-8|9,10|11|12|all}{lines:true,at:1}
```asm{all|1-8|9,10|11|12}{lines:true,at:1}
MOV GRF_A #0, BANK
MOV GRF_A #1, BANK
MOV GRF_A #2, BANK
@@ -94,7 +78,7 @@ code {
Host-side
```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1}
```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
matrix: &Matrix<X16R, X16C>,
input_vector: &Vector<X16C>,
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
</div>
</div>
<!-- </Transform> -->
---
layout: figure-side
figureUrl: /bare_metal.svg
---
## Virtual Prototype
### Platform
<hr/>
<br>
<br>
- ARM processor model
- Bare-metal kernel
- Custom page table configuration
- Non-PIM DRAM region mapped as cacheable memory
- PIM DRAM region mapped as non-cacheable memory
<!--
- bare metal offers most control
-->

View File

@@ -18,6 +18,14 @@
</Footnote>
</Footnotes>
<!--
- compute doubles every two years
- energy production grows linearly at 2% per year
- to meet future compute demands
- -> drastic improvements in energy efficiency
-->
---
## Introduction
@@ -26,7 +34,7 @@
<br>
#### Roofline model of GPT revisions<sup>1</sup>
- AI workloads become increasingly memory-bound
<br>
@@ -39,3 +47,10 @@
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
</Footnote>
</Footnotes>
<!--
- Emerging AI applications become increasingly memory-bound
- Roofline model
- Not limited by compute power but by memory
-> researchers begin to consider processing in memory to circumvent memory bottleneck
-->

View File

@@ -11,15 +11,8 @@
</div>
<!--
- Workload must be memory-bound
- memory-bound:
- fully-connected layers
- layers of recurrent neural networks (RNNs)
- not memory-bound:
- convolutional layers
- data reuse
- fully connected layers of a neural network
- Such that PIM is effective, workload must be memory-bound
-->
---
@@ -52,6 +45,10 @@ clicks: 1
</div>
</Transform>
<!--
- filter matrix is reused
-->
---
## Processing-in-Memory
@@ -67,7 +64,7 @@ clicks: 1
<div>
### Suitable candidates for PIM:
- Multilayer perceptrons (MLPs)
- Fully connected layers in multilayer perceptrons (MLPs)
- Layers in recurrent neural networks (RNNs)
</div>
@@ -130,19 +127,18 @@ To summarize...
</Footnotes>
<!--
- Architecture space of PIM:
- Inside the memory SA
- Ambit
- activate multiple rows at the same time
- bulk logic operations
- simple bulk logic
- Near SA in PSA output region
- CMOS-based logic gates in the region
- logic gates in the region
- Near a bank in its peripheral region
- computation units with control at bank output
- computation units with control
- I/O region of memory
- more traditional accelerator approach
- limited by memory bus
-->
---
@@ -171,12 +167,9 @@ To summarize...
</Footnotes>
<!--
- Real-world PIM implementation based on HBM2
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units
- Three execution modes
- Single-Bank (SB)
- All-Bank (AB)
- All-Bank-PIM (AB-PIM)
- One PIM unit shared by two banks
- 16-wide SIMD FPUs are 16-wide
- All-Bank mode: All PIM units operate in parallel
-->
---
@@ -201,16 +194,15 @@ To summarize...
</Footnotes>
<!--
- Control unit executes RISC instructions
- Two SIMD FPUs
- ADD
- MUL
- CRF: 32 32-bit entries (32 instructions)
- GRF: 16 256-bit entries
- SRF: 16 16-bit entries
- CRF: 32 instructions, stores the program
- GRF: 16 entries, one memory fetch
- SRF: 16 entries
- One instruction is executed when RD or WR command is issued
- Control units executes one instruction when RD or WR command is issued
-->
---
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
</Footnote>
</Footnotes>
<!--
- Procedure of GEMV operation
- multiple cycles
- each PIM unit operatates on one matrix row
- partial sum, reduced by host
-->
---
layout: figure
figureUrl: /layout.svg
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
<br>
<br>
- To analyze the performance gains of PIM, simulation models are needed
- Simulations are needed to analyze the performance gains of PIM
- Research should not only focus on hardware but also explore the software side
<br>

View File

@@ -14,7 +14,7 @@
- Vector-Matrix benchmarks (BLAS level 2)
- GEMV: $z = A \cdot x$
- DNN:
- Simple DNN:
- $f(x) = z = ReLU(A \cdot x)$
- $z_{n+1} = f(z_n)$
- 5 layers in total
@@ -36,24 +36,44 @@ Operand Dimensions
</div>
</div>
<!--
- operand data significantly larger than on-chip cache
-->
---
## Simulations
### System Configuration
<hr/>
<br>
<br>
<br>
- Two simulated systems:
- Generic ARM systems
- Infinite compute ARM system
<div class="grid grid-cols-2 gap-4">
<div>
#### Two simulated systems:
<br>
- Two real GPUs using HBM2:
- AMD RX Vega 56
- NVIDIA V100
- Generic ARM system
- Infinite compute system
- completely memory bound
</div>
<div>
#### Two real GPUs using HBM2:
<br>
- AMD RX Vega 56
- NVIDIA V100
</div>
</div>
---
layout: figure
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
### Speedups / Infinite Compute System
<hr/>
<!--
- VADD: 12.7x
- GEMV: 9.0x
-->
---
layout: figure
figureUrl: /samsung.svg
figureCaption: Speedups of Samsung for VADD and GEMV
figureFootnoteNumber: 1
---
## Simulations
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
- ADD shows deviation
-> differences in hardware architecture
- GPU has no speculative execution
-->
---
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks
<!--
- Real GPUs use multiple memory channels
- Memory barriers
- Also architectural differences
-->