Minor changes
This commit is contained in:
@@ -14,6 +14,7 @@ addons:
|
||||
- slidev-addon-citations
|
||||
biblio:
|
||||
filename: references.bib
|
||||
record: true
|
||||
---
|
||||
|
||||
### Master Thesis
|
||||
|
||||
@@ -3,12 +3,13 @@
|
||||
|
||||
<br>
|
||||
|
||||
A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
|
||||
- PIM can accelerate memory-bound workloads
|
||||
- Special PIM-friendly memory layouts are required
|
||||
|
||||
<br>
|
||||
|
||||
Future work:
|
||||
#### Future work:
|
||||
- Implementation of Linux driver
|
||||
- Comparison with complete neural networks
|
||||
- Comparison with complete neural networks
|
||||
- Consider replacing library approach with compiler approach
|
||||
- Implement a power model to analyze the power efficiency gains
|
||||
|
||||
@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files
|
||||
|
||||
- Provides data structures for operand data and microkernels
|
||||
- Executes programmed microkernels
|
||||
|
||||
---
|
||||
layout: figure-side
|
||||
figureUrl: /bare_metal.svg
|
||||
---
|
||||
|
||||
## Virtual Prototype
|
||||
### Platform
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- Bare-metal kernel executes on ARM processor model
|
||||
- Custom page table configuration
|
||||
- Non-PIM DRAM region mapped as cacheable memory
|
||||
- PIM DRAM region mapped as non-cacheable memory
|
||||
- generate RD and WR requests
|
||||
|
||||
---
|
||||
|
||||
## Virtual Prototype
|
||||
### Platform
|
||||
### GEMV Kernel
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
|
||||
<div>
|
||||
|
||||
DRAM-side
|
||||
```asm{all|1-8|9,10|11|12|all}{lines:true,at:1}
|
||||
```asm{all|1-8|9,10|11|12}{lines:true,at:1}
|
||||
MOV GRF_A #0, BANK
|
||||
MOV GRF_A #1, BANK
|
||||
MOV GRF_A #2, BANK
|
||||
@@ -94,7 +78,7 @@ code {
|
||||
|
||||
Host-side
|
||||
|
||||
```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1}
|
||||
```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
|
||||
pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
|
||||
matrix: &Matrix<X16R, X16C>,
|
||||
input_vector: &Vector<X16C>,
|
||||
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!-- </Transform> -->
|
||||
---
|
||||
layout: figure-side
|
||||
figureUrl: /bare_metal.svg
|
||||
---
|
||||
|
||||
## Virtual Prototype
|
||||
### Platform
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- ARM processor model
|
||||
- Bare-metal kernel
|
||||
- Custom page table configuration
|
||||
- Non-PIM DRAM region mapped as cacheable memory
|
||||
- PIM DRAM region mapped as non-cacheable memory
|
||||
|
||||
<!--
|
||||
- bare metal offers most control
|
||||
-->
|
||||
|
||||
@@ -18,6 +18,14 @@
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- compute doubles every two years
|
||||
- energy production grows linearly at 2% per year
|
||||
|
||||
- to meet future compute demands
|
||||
- -> drastic improvements in energy efficiency
|
||||
-->
|
||||
|
||||
---
|
||||
|
||||
## Introduction
|
||||
@@ -26,7 +34,7 @@
|
||||
|
||||
<br>
|
||||
|
||||
#### Roofline model of GPT revisions<sup>1</sup>
|
||||
- AI workloads become increasingly memory-bound
|
||||
|
||||
<br>
|
||||
|
||||
@@ -39,3 +47,10 @@
|
||||
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- Emerging AI applications become increasingly memory-bound
|
||||
- Roofline model
|
||||
- Not limited by compute power but by memory
|
||||
-> researchers begin to consider processing in memory to circumvent memory bottleneck
|
||||
-->
|
||||
|
||||
@@ -11,15 +11,8 @@
|
||||
</div>
|
||||
|
||||
<!--
|
||||
- Workload must be memory-bound
|
||||
|
||||
- memory-bound:
|
||||
- fully-connected layers
|
||||
- layers of recurrent neural networks (RNNs)
|
||||
|
||||
- not memory-bound:
|
||||
- convolutional layers
|
||||
- data reuse
|
||||
- fully connected layers of a neural network
|
||||
- Such that PIM is effective, workload must be memory-bound
|
||||
-->
|
||||
|
||||
---
|
||||
@@ -52,6 +45,10 @@ clicks: 1
|
||||
</div>
|
||||
</Transform>
|
||||
|
||||
<!--
|
||||
- filter matrix is reused
|
||||
-->
|
||||
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
@@ -67,7 +64,7 @@ clicks: 1
|
||||
<div>
|
||||
|
||||
### Suitable candidates for PIM:
|
||||
- Multilayer perceptrons (MLPs)
|
||||
- Fully connected layers in multilayer perceptrons (MLPs)
|
||||
- Layers in recurrent neural networks (RNNs)
|
||||
|
||||
</div>
|
||||
@@ -130,19 +127,18 @@ To summarize...
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- Architecture space of PIM:
|
||||
- Inside the memory SA
|
||||
- Ambit
|
||||
- activate multiple rows at the same time
|
||||
- bulk logic operations
|
||||
- simple bulk logic
|
||||
|
||||
- Near SA in PSA output region
|
||||
- CMOS-based logic gates in the region
|
||||
- logic gates in the region
|
||||
|
||||
- Near a bank in its peripheral region
|
||||
- computation units with control at bank output
|
||||
- computation units with control
|
||||
|
||||
- I/O region of memory
|
||||
- more traditional accelerator approach
|
||||
- limited by memory bus
|
||||
-->
|
||||
|
||||
---
|
||||
@@ -171,12 +167,9 @@ To summarize...
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- Real-world PIM implementation based on HBM2
|
||||
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units
|
||||
- Three execution modes
|
||||
- Single-Bank (SB)
|
||||
- All-Bank (AB)
|
||||
- All-Bank-PIM (AB-PIM)
|
||||
- One PIM unit shared by two banks
|
||||
- 16-wide SIMD FPUs are 16-wide
|
||||
- All-Bank mode: All PIM units operate in parallel
|
||||
-->
|
||||
|
||||
---
|
||||
@@ -201,16 +194,15 @@ To summarize...
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- Control unit executes RISC instructions
|
||||
- Two SIMD FPUs
|
||||
- ADD
|
||||
- MUL
|
||||
|
||||
- CRF: 32 32-bit entries (32 instructions)
|
||||
- GRF: 16 256-bit entries
|
||||
- SRF: 16 16-bit entries
|
||||
- CRF: 32 instructions, stores the program
|
||||
- GRF: 16 entries, one memory fetch
|
||||
- SRF: 16 entries
|
||||
|
||||
- One instruction is executed when RD or WR command is issued
|
||||
- Control units executes one instruction when RD or WR command is issued
|
||||
-->
|
||||
|
||||
---
|
||||
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- Procedure of GEMV operation
|
||||
- multiple cycles
|
||||
- each PIM unit operatates on one matrix row
|
||||
- partial sum, reduced by host
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /layout.svg
|
||||
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- To analyze the performance gains of PIM, simulation models are needed
|
||||
- Simulations are needed to analyze the performance gains of PIM
|
||||
- Research should not only focus on hardware but also explore the software side
|
||||
|
||||
<br>
|
||||
|
||||
@@ -14,7 +14,7 @@
|
||||
|
||||
- Vector-Matrix benchmarks (BLAS level 2)
|
||||
- GEMV: $z = A \cdot x$
|
||||
- DNN:
|
||||
- Simple DNN:
|
||||
- $f(x) = z = ReLU(A \cdot x)$
|
||||
- $z_{n+1} = f(z_n)$
|
||||
- 5 layers in total
|
||||
@@ -36,24 +36,44 @@ Operand Dimensions
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!--
|
||||
- operand data significantly larger than on-chip cache
|
||||
-->
|
||||
|
||||
---
|
||||
|
||||
## Simulations
|
||||
### System Configuration
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- Two simulated systems:
|
||||
- Generic ARM systems
|
||||
- Infinite compute ARM system
|
||||
<div class="grid grid-cols-2 gap-4">
|
||||
<div>
|
||||
|
||||
#### Two simulated systems:
|
||||
|
||||
<br>
|
||||
|
||||
- Two real GPUs using HBM2:
|
||||
- AMD RX Vega 56
|
||||
- NVIDIA V100
|
||||
- Generic ARM system
|
||||
- Infinite compute system
|
||||
- completely memory bound
|
||||
|
||||
</div>
|
||||
|
||||
<div>
|
||||
|
||||
#### Two real GPUs using HBM2:
|
||||
|
||||
<br>
|
||||
|
||||
- AMD RX Vega 56
|
||||
- NVIDIA V100
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
---
|
||||
layout: figure
|
||||
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
|
||||
### Speedups / Infinite Compute System
|
||||
<hr/>
|
||||
|
||||
<!--
|
||||
- VADD: 12.7x
|
||||
- GEMV: 9.0x
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /samsung.svg
|
||||
figureCaption: Speedups of Samsung for VADD and GEMV
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Simulations
|
||||
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
|
||||
- ADD shows deviation
|
||||
|
||||
-> differences in hardware architecture
|
||||
- GPU has no speculative execution
|
||||
-->
|
||||
|
||||
---
|
||||
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks
|
||||
|
||||
<!--
|
||||
- Real GPUs use multiple memory channels
|
||||
- Memory barriers
|
||||
- Also architectural differences
|
||||
-->
|
||||
|
||||
|
||||
Reference in New Issue
Block a user