Minor changes

This commit is contained in:
2024-04-07 22:41:59 +02:00
parent 3d15758c82
commit d634f97fb2
6 changed files with 107 additions and 61 deletions

View File

@@ -14,6 +14,7 @@ addons:
- slidev-addon-citations - slidev-addon-citations
biblio: biblio:
filename: references.bib filename: references.bib
record: true
--- ---
### Master Thesis ### Master Thesis

View File

@@ -3,12 +3,13 @@
<br> <br>
A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved - PIM can accelerate memory-bound workloads
- Special PIM-friendly memory layouts are required
<br> <br>
Future work: #### Future work:
- Implementation of Linux driver - Implementation of Linux driver
- Comparison with complete neural networks - Comparison with complete neural networks
- Consider replacing library approach with compiler approach - Consider replacing library approach with compiler approach
- Implement a power model to analyze the power efficiency gains - Implement a power model to analyze the power efficiency gains

View File

@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files
- Provides data structures for operand data and microkernels - Provides data structures for operand data and microkernels
- Executes programmed microkernels - Executes programmed microkernels
- generate RD and WR requests
---
layout: figure-side
figureUrl: /bare_metal.svg
---
## Virtual Prototype
### Platform
<hr/>
<br>
<br>
- Bare-metal kernel executes on ARM processor model
- Custom page table configuration
- Non-PIM DRAM region mapped as cacheable memory
- PIM DRAM region mapped as non-cacheable memory
--- ---
## Virtual Prototype ## Virtual Prototype
### Platform ### GEMV Kernel
<hr/> <hr/>
<br> <br>
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
<div> <div>
DRAM-side DRAM-side
```asm{all|1-8|9,10|11|12|all}{lines:true,at:1} ```asm{all|1-8|9,10|11|12}{lines:true,at:1}
MOV GRF_A #0, BANK MOV GRF_A #0, BANK
MOV GRF_A #1, BANK MOV GRF_A #1, BANK
MOV GRF_A #2, BANK MOV GRF_A #2, BANK
@@ -94,7 +78,7 @@ code {
Host-side Host-side
```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1} ```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
pub fn execute<const X16R: usize, const X16C: usize, const R: usize>( pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
matrix: &Matrix<X16R, X16C>, matrix: &Matrix<X16R, X16C>,
input_vector: &Vector<X16C>, input_vector: &Vector<X16C>,
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
</div> </div>
</div> </div>
<!-- </Transform> --> ---
layout: figure-side
figureUrl: /bare_metal.svg
---
## Virtual Prototype
### Platform
<hr/>
<br>
<br>
- ARM processor model
- Bare-metal kernel
- Custom page table configuration
- Non-PIM DRAM region mapped as cacheable memory
- PIM DRAM region mapped as non-cacheable memory
<!--
- bare metal offers most control
-->

View File

@@ -18,6 +18,14 @@
</Footnote> </Footnote>
</Footnotes> </Footnotes>
<!--
- compute doubles every two years
- energy production grows linearly at 2% per year
- to meet future compute demands
- -> drastic improvements in energy efficiency
-->
--- ---
## Introduction ## Introduction
@@ -26,7 +34,7 @@
<br> <br>
#### Roofline model of GPT revisions<sup>1</sup> - AI workloads become increasingly memory-bound
<br> <br>
@@ -39,3 +47,10 @@
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023. Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
</Footnote> </Footnote>
</Footnotes> </Footnotes>
<!--
- Emerging AI applications become increasingly memory-bound
- Roofline model
- Not limited by compute power but by memory
-> researchers begin to consider processing in memory to circumvent memory bottleneck
-->

View File

@@ -11,15 +11,8 @@
</div> </div>
<!-- <!--
- Workload must be memory-bound - fully connected layers of a neural network
- Such that PIM is effective, workload must be memory-bound
- memory-bound:
- fully-connected layers
- layers of recurrent neural networks (RNNs)
- not memory-bound:
- convolutional layers
- data reuse
--> -->
--- ---
@@ -52,6 +45,10 @@ clicks: 1
</div> </div>
</Transform> </Transform>
<!--
- filter matrix is reused
-->
--- ---
## Processing-in-Memory ## Processing-in-Memory
@@ -67,7 +64,7 @@ clicks: 1
<div> <div>
### Suitable candidates for PIM: ### Suitable candidates for PIM:
- Multilayer perceptrons (MLPs) - Fully connected layers in multilayer perceptrons (MLPs)
- Layers in recurrent neural networks (RNNs) - Layers in recurrent neural networks (RNNs)
</div> </div>
@@ -130,19 +127,18 @@ To summarize...
</Footnotes> </Footnotes>
<!-- <!--
- Architecture space of PIM:
- Inside the memory SA - Inside the memory SA
- Ambit - simple bulk logic
- activate multiple rows at the same time
- bulk logic operations
- Near SA in PSA output region - Near SA in PSA output region
- CMOS-based logic gates in the region - logic gates in the region
- Near a bank in its peripheral region - Near a bank in its peripheral region
- computation units with control at bank output - computation units with control
- I/O region of memory - I/O region of memory
- more traditional accelerator approach - limited by memory bus
--> -->
--- ---
@@ -171,12 +167,9 @@ To summarize...
</Footnotes> </Footnotes>
<!-- <!--
- Real-world PIM implementation based on HBM2 - One PIM unit shared by two banks
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units - 16-wide SIMD FPUs are 16-wide
- Three execution modes - All-Bank mode: All PIM units operate in parallel
- Single-Bank (SB)
- All-Bank (AB)
- All-Bank-PIM (AB-PIM)
--> -->
--- ---
@@ -201,16 +194,15 @@ To summarize...
</Footnotes> </Footnotes>
<!-- <!--
- Control unit executes RISC instructions
- Two SIMD FPUs - Two SIMD FPUs
- ADD - ADD
- MUL - MUL
- CRF: 32 32-bit entries (32 instructions) - CRF: 32 instructions, stores the program
- GRF: 16 256-bit entries - GRF: 16 entries, one memory fetch
- SRF: 16 16-bit entries - SRF: 16 entries
- One instruction is executed when RD or WR command is issued - Control units executes one instruction when RD or WR command is issued
--> -->
--- ---
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
</Footnote> </Footnote>
</Footnotes> </Footnotes>
<!--
- Procedure of GEMV operation
- multiple cycles
- each PIM unit operatates on one matrix row
- partial sum, reduced by host
-->
--- ---
layout: figure layout: figure
figureUrl: /layout.svg figureUrl: /layout.svg
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
<br> <br>
<br> <br>
- To analyze the performance gains of PIM, simulation models are needed - Simulations are needed to analyze the performance gains of PIM
- Research should not only focus on hardware but also explore the software side - Research should not only focus on hardware but also explore the software side
<br> <br>

View File

@@ -14,7 +14,7 @@
- Vector-Matrix benchmarks (BLAS level 2) - Vector-Matrix benchmarks (BLAS level 2)
- GEMV: $z = A \cdot x$ - GEMV: $z = A \cdot x$
- DNN: - Simple DNN:
- $f(x) = z = ReLU(A \cdot x)$ - $f(x) = z = ReLU(A \cdot x)$
- $z_{n+1} = f(z_n)$ - $z_{n+1} = f(z_n)$
- 5 layers in total - 5 layers in total
@@ -36,24 +36,44 @@ Operand Dimensions
</div> </div>
</div> </div>
<!--
- operand data significantly larger than on-chip cache
-->
--- ---
## Simulations ## Simulations
### System Configuration ### System Configuration
<hr/> <hr/>
<br>
<br> <br>
<br> <br>
- Two simulated systems: <div class="grid grid-cols-2 gap-4">
- Generic ARM systems <div>
- Infinite compute ARM system
#### Two simulated systems:
<br> <br>
- Two real GPUs using HBM2: - Generic ARM system
- AMD RX Vega 56 - Infinite compute system
- NVIDIA V100 - completely memory bound
</div>
<div>
#### Two real GPUs using HBM2:
<br>
- AMD RX Vega 56
- NVIDIA V100
</div>
</div>
--- ---
layout: figure layout: figure
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
### Speedups / Infinite Compute System ### Speedups / Infinite Compute System
<hr/> <hr/>
<!--
- VADD: 12.7x
- GEMV: 9.0x
-->
--- ---
layout: figure layout: figure
figureUrl: /samsung.svg figureUrl: /samsung.svg
figureCaption: Speedups of Samsung for VADD and GEMV figureCaption: Speedups of Samsung for VADD and GEMV
figureFootnoteNumber: 1
--- ---
## Simulations ## Simulations
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
- ADD shows deviation - ADD shows deviation
-> differences in hardware architecture -> differences in hardware architecture
- GPU has no speculative execution
--> -->
--- ---
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks
<!-- <!--
- Real GPUs use multiple memory channels - Real GPUs use multiple memory channels
- Memory barriers
- Also architectural differences - Also architectural differences
--> -->