Minor changes

2024-04-07 22:41:59 +02:00
parent 3d15758c82
commit d634f97fb2
6 changed files with 107 additions and 61 deletions
--- a/slides.md
+++ b/slides.md
@@ -14,6 +14,7 @@ addons:
  - slidev-addon-citations
 biblio:
  filename: references.bib
 record: true
 ---
 ### Master Thesis
--- a/slides/conclusion.md
+++ b/slides/conclusion.md
@@ -3,12 +3,13 @@
 <br>
-A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
+- PIM can accelerate memory-bound workloads
 - Special PIM-friendly memory layouts are required
 <br>
-Future work:
+#### Future work:
  - Implementation of Linux driver
-  - Comparison with complete neural networks
+    - Comparison with complete neural networks
  - Consider replacing library approach with compiler approach
  - Implement a power model to analyze the power efficiency gains
--- a/slides/implementation.md
+++ b/slides/implementation.md
@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files
 - Provides data structures for operand data and microkernels
 - Executes programmed microkernels
-
+  - generate RD and WR requests
 ---
 layout: figure-side
 figureUrl: /bare_metal.svg
 ---
 ## Virtual Prototype
 ### Platform
 <hr/>
 <br>
 <br>
 - Bare-metal kernel executes on ARM processor model
 - Custom page table configuration
  - Non-PIM DRAM region mapped as cacheable memory
  - PIM DRAM region mapped as non-cacheable memory
 ---
 ## Virtual Prototype
-### Platform
+### GEMV Kernel
 <hr/>
 <br>
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
 <div>
 DRAM-side
-```asm{all|1-8|9,10|11|12|all}{lines:true,at:1}
+```asm{all|1-8|9,10|11|12}{lines:true,at:1}
 MOV GRF_A #0, BANK
 MOV GRF_A #1, BANK
 MOV GRF_A #2, BANK
@@ -94,7 +78,7 @@ code {
 Host-side
-```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1}
+```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
 pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
    matrix: &Matrix<X16R, X16C>,
    input_vector: &Vector<X16C>,
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
 </div>
 </div>
-<!-- </Transform> -->
+---
 layout: figure-side
 figureUrl: /bare_metal.svg
 ---
 ## Virtual Prototype
 ### Platform
 <hr/>
 <br>
 <br>
 - ARM processor model
 - Bare-metal kernel
 - Custom page table configuration
  - Non-PIM DRAM region mapped as cacheable memory
  - PIM DRAM region mapped as non-cacheable memory
 <!--
 - bare metal offers most control
 -->
--- a/slides/introduction.md
+++ b/slides/introduction.md
@@ -18,6 +18,14 @@
  </Footnote>
 </Footnotes>
 <!--
 - compute doubles every two years
 - energy production grows linearly at 2% per year
 - to meet future compute demands
  - -> drastic improvements in energy efficiency
 -->
 ---
 ## Introduction
@@ -26,7 +34,7 @@
 <br>
-#### Roofline model of GPT revisions<sup>1</sup>
+- AI workloads become increasingly memory-bound
 <br>
@@ -39,3 +47,10 @@
  Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
  </Footnote>
 </Footnotes>
 <!--
 - Emerging AI applications become increasingly memory-bound
 - Roofline model
 - Not limited by compute power but by memory
 -> researchers begin to consider processing in memory to circumvent memory bottleneck
 -->
--- a/slides/pim.md
+++ b/slides/pim.md
@@ -11,15 +11,8 @@
 </div>
 <!--
- Workload must be memory-bound
+- fully connected layers of a neural network
-
+- Such that PIM is effective, workload must be memory-bound
 - memory-bound:
  - fully-connected layers
  - layers of recurrent neural networks (RNNs)
 - not memory-bound:
  - convolutional layers
  	- data reuse
 -->
 ---
@@ -52,6 +45,10 @@ clicks: 1
  </div>
 </Transform>
 <!--
 - filter matrix is reused
 -->
 ---
 ## Processing-in-Memory
@@ -67,7 +64,7 @@ clicks: 1
 <div>
 ### Suitable candidates for PIM:
- - Multilayer perceptrons (MLPs)
+ - Fully connected layers in multilayer perceptrons (MLPs)
 - Layers in recurrent neural networks (RNNs)
 </div>
@@ -130,19 +127,18 @@ To summarize...
 </Footnotes>
 <!--
 - Architecture space of PIM:
 - Inside the memory SA
-  - Ambit
+  - simple bulk logic
    - activate multiple rows at the same time
    - bulk logic operations
 - Near SA in PSA output region
-  - CMOS-based logic gates in the region
+  - logic gates in the region
 - Near a bank in its peripheral region
-  - computation units with control at bank output
+  - computation units with control
 - I/O region of memory
-  - more traditional accelerator approach
+  - limited by memory bus
 -->
 ---
@@ -171,12 +167,9 @@ To summarize...
 </Footnotes>
 <!--
- Real-world PIM implementation based on HBM2
+- One PIM unit shared by two banks
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units
+- 16-wide SIMD FPUs are 16-wide
- Three execution modes
+- All-Bank mode: All PIM units operate in parallel
    - Single-Bank (SB)
    - All-Bank (AB)
    - All-Bank-PIM (AB-PIM)
 -->
 ---
@@ -201,16 +194,15 @@ To summarize...
 </Footnotes>
 <!--
 - Control unit executes RISC instructions
 - Two SIMD FPUs
  - ADD
  - MUL
- CRF: 32 32-bit entries (32 instructions)
+- CRF: 32 instructions, stores the program
- GRF: 16 256-bit entries
+- GRF: 16 entries, one memory fetch
- SRF: 16 16-bit entries
+- SRF: 16 entries
- One instruction is executed when RD or WR command is issued
+- Control units executes one instruction when RD or WR command is issued
 -->
 ---
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
 </Footnote>
 </Footnotes>
 <!--
 - Procedure of GEMV operation
 - multiple cycles
 - each PIM unit operatates on one matrix row
 - partial sum, reduced by host
 -->
 ---
 layout: figure
 figureUrl: /layout.svg
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
 <br>
 <br>
- To analyze the performance gains of PIM, simulation models are needed
+- Simulations are needed to analyze the performance gains of PIM
 - Research should not only focus on hardware but also explore the software side
 <br>
--- a/slides/simulations.md
+++ b/slides/simulations.md
@@ -14,7 +14,7 @@
 - Vector-Matrix benchmarks (BLAS level 2)
    - GEMV: $z = A \cdot x$
-    - DNN:
+    - Simple DNN:
      - $f(x) = z = ReLU(A \cdot x)$
      - $z_{n+1} = f(z_n)$
      - 5 layers in total
@@ -36,24 +36,44 @@ Operand Dimensions
 </div>
 </div>
 <!--
 - operand data significantly larger than on-chip cache
 -->
 ---
 ## Simulations
 ### System Configuration
 <hr/>
 <br>
 <br>
 <br>
- Two simulated systems:
+<div class="grid grid-cols-2 gap-4">
-    - Generic ARM systems
+<div>
-    - Infinite compute ARM system
+
 #### Two simulated systems:
 <br>
- Two real GPUs using HBM2:
+- Generic ARM system
-  - AMD RX Vega 56
+- Infinite compute system
-  - NVIDIA V100
+  - completely memory bound
 </div>
 <div>
 #### Two real GPUs using HBM2:
 <br>
 - AMD RX Vega 56
 - NVIDIA V100
 </div>
 </div>
 ---
 layout: figure
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
 ### Speedups / Infinite Compute System
 <hr/>
 <!--
 - VADD: 12.7x
 - GEMV: 9.0x
 -->
 ---
 layout: figure
 figureUrl: /samsung.svg
 figureCaption: Speedups of Samsung for VADD and GEMV
 figureFootnoteNumber: 1
 ---
 ## Simulations
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
 - ADD shows deviation
 -> differences in hardware architecture
 - GPU has no speculative execution
 -->
 ---
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks
 <!--
 - Real GPUs use multiple memory channels
 - Memory barriers
 - Also architectural differences
 -->