Minor changes

2024-04-07 22:41:59 +02:00
parent 3d15758c82
commit d634f97fb2
6 changed files with 107 additions and 61 deletions
--- a/slides/conclusion.md
+++ b/slides/conclusion.md
@@ -3,12 +3,13 @@

 <br>

-A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
+- PIM can accelerate memory-bound workloads
+- Special PIM-friendly memory layouts are required

 <br>

-Future work:
+#### Future work:
  - Implementation of Linux driver
-  - Comparison with complete neural networks
+    - Comparison with complete neural networks
  - Consider replacing library approach with compiler approach
  - Implement a power model to analyze the power efficiency gains
--- a/slides/implementation.md
+++ b/slides/implementation.md
@@ -38,28 +38,12 @@ figureCaption: Data structures for instructions and register files

 - Provides data structures for operand data and microkernels
 - Executes programmed microkernels
-
---
-layout: figure-side
-figureUrl: /bare_metal.svg
---
-
-## Virtual Prototype
-### Platform
-<hr/>
-
-<br>
-<br>
-
- Bare-metal kernel executes on ARM processor model
- Custom page table configuration
-  - Non-PIM DRAM region mapped as cacheable memory
-  - PIM DRAM region mapped as non-cacheable memory
+  - generate RD and WR requests

 ---

 ## Virtual Prototype
-### Platform
+### GEMV Kernel
 <hr/>

 <br>
@@ -68,7 +52,7 @@ figureUrl: /bare_metal.svg
 <div>

 DRAM-side
-```asm{all|1-8|9,10|11|12|all}{lines:true,at:1}
+```asm{all|1-8|9,10|11|12}{lines:true,at:1}
 MOV GRF_A #0, BANK
 MOV GRF_A #1, BANK
 MOV GRF_A #2, BANK
@@ -94,7 +78,7 @@ code {

 Host-side

-```rust {all|7-10|12-17|19-28|30-31|all}{lines:true,maxHeight:'15em',at:1}
+```rust {all|7-10|12-17|19-28|30-31}{lines:true,maxHeight:'15em',at:1}
 pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
    matrix: &Matrix<X16R, X16C>,
    input_vector: &Vector<X16C>,
@@ -131,4 +115,24 @@ pub fn execute<const X16R: usize, const X16C: usize, const R: usize>(
 </div>
 </div>

-<!-- </Transform> -->
+---
+layout: figure-side
+figureUrl: /bare_metal.svg
+---
+
+## Virtual Prototype
+### Platform
+<hr/>
+
+<br>
+<br>
+
+- ARM processor model
+- Bare-metal kernel
+- Custom page table configuration
+  - Non-PIM DRAM region mapped as cacheable memory
+  - PIM DRAM region mapped as non-cacheable memory
+
+<!--
+- bare metal offers most control
+-->
--- a/slides/introduction.md
+++ b/slides/introduction.md
@@ -18,6 +18,14 @@
  </Footnote>
 </Footnotes>

+<!--
+- compute doubles every two years
+- energy production grows linearly at 2% per year
+
+- to meet future compute demands
+  - -> drastic improvements in energy efficiency
+-->
+
 ---

 ## Introduction
@@ -26,7 +34,7 @@

 <br>

-#### Roofline model of GPT revisions<sup>1</sup>
+- AI workloads become increasingly memory-bound

 <br>

@@ -39,3 +47,10 @@
  Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
  </Footnote>
 </Footnotes>
+
+<!--
+- Emerging AI applications become increasingly memory-bound
+- Roofline model
+- Not limited by compute power but by memory
+-> researchers begin to consider processing in memory to circumvent memory bottleneck
+-->
--- a/slides/pim.md
+++ b/slides/pim.md
@@ -11,15 +11,8 @@
 </div>

 <!--
- Workload must be memory-bound
-
- memory-bound:
-  - fully-connected layers
-  - layers of recurrent neural networks (RNNs)
-  
- not memory-bound:
-  - convolutional layers
-  	- data reuse
+- fully connected layers of a neural network
+- Such that PIM is effective, workload must be memory-bound
 -->

 ---
@@ -52,6 +45,10 @@ clicks: 1
  </div>
 </Transform>

+<!--
+- filter matrix is reused
+-->
+
 ---

 ## Processing-in-Memory
@@ -67,7 +64,7 @@ clicks: 1
 <div>

 ### Suitable candidates for PIM:
- - Multilayer perceptrons (MLPs)
+ - Fully connected layers in multilayer perceptrons (MLPs)
 - Layers in recurrent neural networks (RNNs)

 </div>
@@ -130,19 +127,18 @@ To summarize...
 </Footnotes>

 <!--
+- Architecture space of PIM:
 - Inside the memory SA
-  - Ambit
-    - activate multiple rows at the same time
-    - bulk logic operations
+  - simple bulk logic
    
 - Near SA in PSA output region
-  - CMOS-based logic gates in the region
+  - logic gates in the region
  
 - Near a bank in its peripheral region
-  - computation units with control at bank output
+  - computation units with control
  
 - I/O region of memory
-  - more traditional accelerator approach
+  - limited by memory bus
 -->

 ---
@@ -171,12 +167,9 @@ To summarize...
 </Footnotes>

 <!--
- Real-world PIM implementation based on HBM2
- SIMD FPUs are 16-wide, i.e., there are 16 FPU units
- Three execution modes
-    - Single-Bank (SB)
-    - All-Bank (AB)
-    - All-Bank-PIM (AB-PIM)
+- One PIM unit shared by two banks
+- 16-wide SIMD FPUs are 16-wide
+- All-Bank mode: All PIM units operate in parallel
 -->

 ---
@@ -201,16 +194,15 @@ To summarize...
 </Footnotes>

 <!--
- Control unit executes RISC instructions
 - Two SIMD FPUs
  - ADD
  - MUL

- CRF: 32 32-bit entries (32 instructions)
- GRF: 16 256-bit entries
- SRF: 16 16-bit entries
+- CRF: 32 instructions, stores the program
+- GRF: 16 entries, one memory fetch
+- SRF: 16 entries

- One instruction is executed when RD or WR command is issued
+- Control units executes one instruction when RD or WR command is issued
 -->

 ---
@@ -229,6 +221,13 @@ figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
 </Footnote>
 </Footnotes>

+<!--
+- Procedure of GEMV operation
+- multiple cycles
+- each PIM unit operatates on one matrix row
+- partial sum, reduced by host
+-->
+
 ---
 layout: figure
 figureUrl: /layout.svg
@@ -254,7 +253,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
 <br>
 <br>

- To analyze the performance gains of PIM, simulation models are needed
+- Simulations are needed to analyze the performance gains of PIM
 - Research should not only focus on hardware but also explore the software side

 <br>
--- a/slides/simulations.md
+++ b/slides/simulations.md
@@ -14,7 +14,7 @@

 - Vector-Matrix benchmarks (BLAS level 2)
    - GEMV: $z = A \cdot x$
-    - DNN:
+    - Simple DNN:
      - $f(x) = z = ReLU(A \cdot x)$
      - $z_{n+1} = f(z_n)$
      - 5 layers in total
@@ -36,24 +36,44 @@ Operand Dimensions
 </div>
 </div>

+<!--
+- operand data significantly larger than on-chip cache
+-->
+
 ---

 ## Simulations
 ### System Configuration
 <hr/>

+<br>
 <br>
 <br>

- Two simulated systems:
-    - Generic ARM systems
-    - Infinite compute ARM system
+<div class="grid grid-cols-2 gap-4">
+<div>
+
+#### Two simulated systems:

 <br>

- Two real GPUs using HBM2:
-  - AMD RX Vega 56
-  - NVIDIA V100
+- Generic ARM system
+- Infinite compute system
+  - completely memory bound
+
+</div>
+
+<div>
+
+#### Two real GPUs using HBM2:
+
+<br>
+
+- AMD RX Vega 56
+- NVIDIA V100
+
+</div>
+</div>

 ---
 layout: figure
@@ -75,11 +95,15 @@ figureCaption: Speedups of PIM compared to non-PIM
 ### Speedups / Infinite Compute System
 <hr/>

+<!--
+- VADD: 12.7x
+- GEMV: 9.0x
+-->
+
 ---
 layout: figure
 figureUrl: /samsung.svg
 figureCaption: Speedups of Samsung for VADD and GEMV
-figureFootnoteNumber: 1
 ---

 ## Simulations
@@ -97,6 +121,7 @@ figureFootnoteNumber: 1
 - ADD shows deviation

 -> differences in hardware architecture
+- GPU has no speculative execution
 -->

 ---
@@ -111,6 +136,7 @@ figureCaption: Runtimes for Vector Benchmarks

 <!--
 - Real GPUs use multiple memory channels
+- Memory barriers
 - Also architectural differences
 -->