diff --git a/slides.md b/slides.md index e0e3bc1..c502e36 100644 --- a/slides.md +++ b/slides.md @@ -14,6 +14,7 @@ addons: - slidev-addon-citations biblio: filename: references.bib +record: true --- ### Master Thesis diff --git a/slides/conclusion.md b/slides/conclusion.md index 79a5dc7..cd7513c 100644 --- a/slides/conclusion.md +++ b/slides/conclusion.md @@ -3,12 +3,13 @@
-A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved +- PIM can accelerate memory-bound workloads +- Special PIM-friendly memory layouts are required
-Future work: +#### Future work: - Implementation of Linux driver - - Comparison with complete neural networks + - Comparison with complete neural networks - Consider replacing library approach with compiler approach - Implement a power model to analyze the power efficiency gains diff --git a/slides/implementation.md b/slides/implementation.md index ded60a6..06a74ac 100644 --- a/slides/implementation.md +++ b/slides/implementation.md @@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files - Provides data structures for operand data and microkernels - Executes programmed microkernels - ---- -layout: figure-side -figureUrl: /bare_metal.svg ---- - -## Virtual Prototype -### Platform -
- -
-
- -- Bare-metal kernel executes on ARM processor model -- Custom page table configuration - - Non-PIM DRAM region mapped as cacheable memory - - PIM DRAM region mapped as non-cacheable memory + - generate RD and WR requests --- ## Virtual Prototype -### Platform +### GEMV Kernel

@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
DRAM-side -```asm{all|1-8|9,10|11|12|all}{lines:true,at:1} +```asm{all|1-8|9,10|11|12}{lines:true,at:1} MOV GRF_A #0, BANK MOV GRF_A #1, BANK MOV GRF_A #2, BANK @@ -94,7 +78,7 @@ code { Host-side -```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1} +```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1} pub fn execute( matrix: &Matrix, input_vector: &Vector, @@ -131,4 +115,24 @@ pub fn execute(
- +--- +layout: figure-side +figureUrl: /bare_metal.svg +--- + +## Virtual Prototype +### Platform +
+ +
+
+ +- ARM processor model +- Bare-metal kernel +- Custom page table configuration + - Non-PIM DRAM region mapped as cacheable memory + - PIM DRAM region mapped as non-cacheable memory + + diff --git a/slides/introduction.md b/slides/introduction.md index 07d14b2..0b4a126 100644 --- a/slides/introduction.md +++ b/slides/introduction.md @@ -18,6 +18,14 @@ + + --- ## Introduction @@ -26,7 +34,7 @@
-#### Roofline model of GPT revisions1 +- AI workloads become increasingly memory-bound
@@ -39,3 +47,10 @@ Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023. + + diff --git a/slides/pim.md b/slides/pim.md index 01f44e3..68d03aa 100644 --- a/slides/pim.md +++ b/slides/pim.md @@ -11,15 +11,8 @@ --- @@ -52,6 +45,10 @@ clicks: 1 + + --- ## Processing-in-Memory @@ -67,7 +64,7 @@ clicks: 1
### Suitable candidates for PIM: - - Multilayer perceptrons (MLPs) + - Fully connected layers in multilayer perceptrons (MLPs) - Layers in recurrent neural networks (RNNs)
@@ -130,19 +127,18 @@ To summarize... --- @@ -171,12 +167,9 @@ To summarize... --- @@ -201,16 +194,15 @@ To summarize... --- @@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation + + --- layout: figure figureUrl: /layout.svg @@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks

-- To analyze the performance gains of PIM, simulation models are needed +- Simulations are needed to analyze the performance gains of PIM - Research should not only focus on hardware but also explore the software side
diff --git a/slides/simulations.md b/slides/simulations.md index dcdbb38..a6c6779 100644 --- a/slides/simulations.md +++ b/slides/simulations.md @@ -14,7 +14,7 @@ - Vector-Matrix benchmarks (BLAS level 2) - GEMV: $z = A \cdot x$ - - DNN: + - Simple DNN: - $f(x) = z = ReLU(A \cdot x)$ - $z_{n+1} = f(z_n)$ - 5 layers in total @@ -36,24 +36,44 @@ Operand Dimensions + + --- ## Simulations ### System Configuration
+


-- Two simulated systems: - - Generic ARM systems - - Infinite compute ARM system +
+
+ +#### Two simulated systems:
-- Two real GPUs using HBM2: - - AMD RX Vega 56 - - NVIDIA V100 +- Generic ARM system +- Infinite compute system + - completely memory bound + +
+ +
+ +#### Two real GPUs using HBM2: + +
+ +- AMD RX Vega 56 +- NVIDIA V100 + +
+
--- layout: figure @@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM ### Speedups / Infinite Compute System
+ + --- layout: figure figureUrl: /samsung.svg figureCaption: Speedups of Samsung for VADD and GEMV -figureFootnoteNumber: 1 --- ## Simulations @@ -97,6 +121,7 @@ figureFootnoteNumber: 1 - ADD shows deviation -> differences in hardware architecture +- GPU has no speculative execution --> --- @@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks