Minor changes
This commit is contained in:
@@ -14,6 +14,7 @@ addons:
|
|||||||
- slidev-addon-citations
|
- slidev-addon-citations
|
||||||
biblio:
|
biblio:
|
||||||
filename: references.bib
|
filename: references.bib
|
||||||
|
record: true
|
||||||
---
|
---
|
||||||
|
|
||||||
### Master Thesis
|
### Master Thesis
|
||||||
|
|||||||
@@ -3,11 +3,12 @@
|
|||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
|
- PIM can accelerate memory-bound workloads
|
||||||
|
- Special PIM-friendly memory layouts are required
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
Future work:
|
#### Future work:
|
||||||
- Implementation of Linux driver
|
- Implementation of Linux driver
|
||||||
- Comparison with complete neural networks
|
- Comparison with complete neural networks
|
||||||
- Consider replacing library approach with compiler approach
|
- Consider replacing library approach with compiler approach
|
||||||
|
|||||||
@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files
|
|||||||
|
|
||||||
- Provides data structures for operand data and microkernels
|
- Provides data structures for operand data and microkernels
|
||||||
- Executes programmed microkernels
|
- Executes programmed microkernels
|
||||||
|
- generate RD and WR requests
|
||||||
---
|
|
||||||
layout: figure-side
|
|
||||||
figureUrl: /bare_metal.svg
|
|
||||||
---
|
|
||||||
|
|
||||||
## Virtual Prototype
|
|
||||||
### Platform
|
|
||||||
<hr/>
|
|
||||||
|
|
||||||
<br>
|
|
||||||
<br>
|
|
||||||
|
|
||||||
- Bare-metal kernel executes on ARM processor model
|
|
||||||
- Custom page table configuration
|
|
||||||
- Non-PIM DRAM region mapped as cacheable memory
|
|
||||||
- PIM DRAM region mapped as non-cacheable memory
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Virtual Prototype
|
## Virtual Prototype
|
||||||
### Platform
|
### GEMV Kernel
|
||||||
<hr/>
|
<hr/>
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
|
|||||||
<div>
|
<div>
|
||||||
|
|
||||||
DRAM-side
|
DRAM-side
|
||||||
```asm{all|1-8|9,10|11|12|all}{lines:true,at:1}
|
```asm{all|1-8|9,10|11|12}{lines:true,at:1}
|
||||||
MOV GRF_A #0, BANK
|
MOV GRF_A #0, BANK
|
||||||
MOV GRF_A #1, BANK
|
MOV GRF_A #1, BANK
|
||||||
MOV GRF_A #2, BANK
|
MOV GRF_A #2, BANK
|
||||||
@@ -94,7 +78,7 @@ code {
|
|||||||
|
|
||||||
Host-side
|
Host-side
|
||||||
|
|
||||||
```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1}
|
```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
|
||||||
pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
|
pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
|
||||||
matrix: &Matrix<X16R, X16C>,
|
matrix: &Matrix<X16R, X16C>,
|
||||||
input_vector: &Vector<X16C>,
|
input_vector: &Vector<X16C>,
|
||||||
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!-- </Transform> -->
|
---
|
||||||
|
layout: figure-side
|
||||||
|
figureUrl: /bare_metal.svg
|
||||||
|
---
|
||||||
|
|
||||||
|
## Virtual Prototype
|
||||||
|
### Platform
|
||||||
|
<hr/>
|
||||||
|
|
||||||
|
<br>
|
||||||
|
<br>
|
||||||
|
|
||||||
|
- ARM processor model
|
||||||
|
- Bare-metal kernel
|
||||||
|
- Custom page table configuration
|
||||||
|
- Non-PIM DRAM region mapped as cacheable memory
|
||||||
|
- PIM DRAM region mapped as non-cacheable memory
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- bare metal offers most control
|
||||||
|
-->
|
||||||
|
|||||||
@@ -18,6 +18,14 @@
|
|||||||
</Footnote>
|
</Footnote>
|
||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- compute doubles every two years
|
||||||
|
- energy production grows linearly at 2% per year
|
||||||
|
|
||||||
|
- to meet future compute demands
|
||||||
|
- -> drastic improvements in energy efficiency
|
||||||
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Introduction
|
## Introduction
|
||||||
@@ -26,7 +34,7 @@
|
|||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
#### Roofline model of GPT revisions<sup>1</sup>
|
- AI workloads become increasingly memory-bound
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
@@ -39,3 +47,10 @@
|
|||||||
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
|
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
|
||||||
</Footnote>
|
</Footnote>
|
||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- Emerging AI applications become increasingly memory-bound
|
||||||
|
- Roofline model
|
||||||
|
- Not limited by compute power but by memory
|
||||||
|
-> researchers begin to consider processing in memory to circumvent memory bottleneck
|
||||||
|
-->
|
||||||
|
|||||||
@@ -11,15 +11,8 @@
|
|||||||
</div>
|
</div>
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
- Workload must be memory-bound
|
- fully connected layers of a neural network
|
||||||
|
- Such that PIM is effective, workload must be memory-bound
|
||||||
- memory-bound:
|
|
||||||
- fully-connected layers
|
|
||||||
- layers of recurrent neural networks (RNNs)
|
|
||||||
|
|
||||||
- not memory-bound:
|
|
||||||
- convolutional layers
|
|
||||||
- data reuse
|
|
||||||
-->
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -52,6 +45,10 @@ clicks: 1
|
|||||||
</div>
|
</div>
|
||||||
</Transform>
|
</Transform>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- filter matrix is reused
|
||||||
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Processing-in-Memory
|
## Processing-in-Memory
|
||||||
@@ -67,7 +64,7 @@ clicks: 1
|
|||||||
<div>
|
<div>
|
||||||
|
|
||||||
### Suitable candidates for PIM:
|
### Suitable candidates for PIM:
|
||||||
- Multilayer perceptrons (MLPs)
|
- Fully connected layers in multilayer perceptrons (MLPs)
|
||||||
- Layers in recurrent neural networks (RNNs)
|
- Layers in recurrent neural networks (RNNs)
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
@@ -130,19 +127,18 @@ To summarize...
|
|||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
|
- Architecture space of PIM:
|
||||||
- Inside the memory SA
|
- Inside the memory SA
|
||||||
- Ambit
|
- simple bulk logic
|
||||||
- activate multiple rows at the same time
|
|
||||||
- bulk logic operations
|
|
||||||
|
|
||||||
- Near SA in PSA output region
|
- Near SA in PSA output region
|
||||||
- CMOS-based logic gates in the region
|
- logic gates in the region
|
||||||
|
|
||||||
- Near a bank in its peripheral region
|
- Near a bank in its peripheral region
|
||||||
- computation units with control at bank output
|
- computation units with control
|
||||||
|
|
||||||
- I/O region of memory
|
- I/O region of memory
|
||||||
- more traditional accelerator approach
|
- limited by memory bus
|
||||||
-->
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -171,12 +167,9 @@ To summarize...
|
|||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
- Real-world PIM implementation based on HBM2
|
- One PIM unit shared by two banks
|
||||||
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units
|
- 16-wide SIMD FPUs are 16-wide
|
||||||
- Three execution modes
|
- All-Bank mode: All PIM units operate in parallel
|
||||||
- Single-Bank (SB)
|
|
||||||
- All-Bank (AB)
|
|
||||||
- All-Bank-PIM (AB-PIM)
|
|
||||||
-->
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -201,16 +194,15 @@ To summarize...
|
|||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
<!--
|
<!--
|
||||||
- Control unit executes RISC instructions
|
|
||||||
- Two SIMD FPUs
|
- Two SIMD FPUs
|
||||||
- ADD
|
- ADD
|
||||||
- MUL
|
- MUL
|
||||||
|
|
||||||
- CRF: 32 32-bit entries (32 instructions)
|
- CRF: 32 instructions, stores the program
|
||||||
- GRF: 16 256-bit entries
|
- GRF: 16 entries, one memory fetch
|
||||||
- SRF: 16 16-bit entries
|
- SRF: 16 entries
|
||||||
|
|
||||||
- One instruction is executed when RD or WR command is issued
|
- Control units executes one instruction when RD or WR command is issued
|
||||||
-->
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
|
|||||||
</Footnote>
|
</Footnote>
|
||||||
</Footnotes>
|
</Footnotes>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- Procedure of GEMV operation
|
||||||
|
- multiple cycles
|
||||||
|
- each PIM unit operatates on one matrix row
|
||||||
|
- partial sum, reduced by host
|
||||||
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
layout: figure
|
layout: figure
|
||||||
figureUrl: /layout.svg
|
figureUrl: /layout.svg
|
||||||
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
|
|||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
- To analyze the performance gains of PIM, simulation models are needed
|
- Simulations are needed to analyze the performance gains of PIM
|
||||||
- Research should not only focus on hardware but also explore the software side
|
- Research should not only focus on hardware but also explore the software side
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|||||||
@@ -14,7 +14,7 @@
|
|||||||
|
|
||||||
- Vector-Matrix benchmarks (BLAS level 2)
|
- Vector-Matrix benchmarks (BLAS level 2)
|
||||||
- GEMV: $z = A \cdot x$
|
- GEMV: $z = A \cdot x$
|
||||||
- DNN:
|
- Simple DNN:
|
||||||
- $f(x) = z = ReLU(A \cdot x)$
|
- $f(x) = z = ReLU(A \cdot x)$
|
||||||
- $z_{n+1} = f(z_n)$
|
- $z_{n+1} = f(z_n)$
|
||||||
- 5 layers in total
|
- 5 layers in total
|
||||||
@@ -36,24 +36,44 @@ Operand Dimensions
|
|||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- operand data significantly larger than on-chip cache
|
||||||
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Simulations
|
## Simulations
|
||||||
### System Configuration
|
### System Configuration
|
||||||
<hr/>
|
<hr/>
|
||||||
|
|
||||||
|
<br>
|
||||||
<br>
|
<br>
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
- Two simulated systems:
|
<div class="grid grid-cols-2 gap-4">
|
||||||
- Generic ARM systems
|
<div>
|
||||||
- Infinite compute ARM system
|
|
||||||
|
#### Two simulated systems:
|
||||||
|
|
||||||
<br>
|
<br>
|
||||||
|
|
||||||
- Two real GPUs using HBM2:
|
- Generic ARM system
|
||||||
- AMD RX Vega 56
|
- Infinite compute system
|
||||||
- NVIDIA V100
|
- completely memory bound
|
||||||
|
|
||||||
|
</div>
|
||||||
|
|
||||||
|
<div>
|
||||||
|
|
||||||
|
#### Two real GPUs using HBM2:
|
||||||
|
|
||||||
|
<br>
|
||||||
|
|
||||||
|
- AMD RX Vega 56
|
||||||
|
- NVIDIA V100
|
||||||
|
|
||||||
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
---
|
---
|
||||||
layout: figure
|
layout: figure
|
||||||
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
|
|||||||
### Speedups / Infinite Compute System
|
### Speedups / Infinite Compute System
|
||||||
<hr/>
|
<hr/>
|
||||||
|
|
||||||
|
<!--
|
||||||
|
- VADD: 12.7x
|
||||||
|
- GEMV: 9.0x
|
||||||
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
layout: figure
|
layout: figure
|
||||||
figureUrl: /samsung.svg
|
figureUrl: /samsung.svg
|
||||||
figureCaption: Speedups of Samsung for VADD and GEMV
|
figureCaption: Speedups of Samsung for VADD and GEMV
|
||||||
figureFootnoteNumber: 1
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Simulations
|
## Simulations
|
||||||
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
|
|||||||
- ADD shows deviation
|
- ADD shows deviation
|
||||||
|
|
||||||
-> differences in hardware architecture
|
-> differences in hardware architecture
|
||||||
|
- GPU has no speculative execution
|
||||||
-->
|
-->
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks
|
|||||||
|
|
||||||
<!--
|
<!--
|
||||||
- Real GPUs use multiple memory channels
|
- Real GPUs use multiple memory channels
|
||||||
|
- Memory barriers
|
||||||
- Also architectural differences
|
- Also architectural differences
|
||||||
-->
|
-->
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user