Refactor presentation
|
Before Width: | Height: | Size: 54 KiB After Width: | Height: | Size: 54 KiB |
@@ -1791,9 +1791,9 @@
|
||||
inkscape:pageopacity="0"
|
||||
inkscape:pagecheckerboard="0"
|
||||
inkscape:deskcolor="#d1d1d1"
|
||||
inkscape:zoom="2.4038052"
|
||||
inkscape:cx="376.48641"
|
||||
inkscape:cy="146.01849"
|
||||
inkscape:zoom="0.84987348"
|
||||
inkscape:cx="436.53557"
|
||||
inkscape:cy="289.45485"
|
||||
inkscape:window-width="2194"
|
||||
inkscape:window-height="1158"
|
||||
inkscape:window-x="0"
|
||||
|
||||
|
Before Width: | Height: | Size: 276 KiB After Width: | Height: | Size: 276 KiB |
@@ -6407,20 +6407,20 @@
|
||||
id="g276"
|
||||
transform="matrix(1.2839206,0,0,1.2839206,-276.91223,57.78747)">
|
||||
<path
|
||||
d="m 326.43557,68.032148 h 61.338 v 8.120261 h -61.338 z"
|
||||
d="m 303.28562,68.032148 h 84.48795 v 8.120261 h -84.48795 z"
|
||||
style="fill:#f5f000;fill-opacity:1;fill-rule:evenodd;stroke:#000000;stroke-width:1.04686;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
|
||||
transform="matrix(1.143907,0,0,1.1485333,-51.191119,119.61054)"
|
||||
transform="matrix(1.143907,0,0,1.1485333,-24.709729,119.61054)"
|
||||
clip-path="url(#clipPath117-3)"
|
||||
id="path276-9"
|
||||
sodipodi:nodetypes="ccccc" />
|
||||
<text
|
||||
xml:space="preserve"
|
||||
style="font-size:8px;fill:#000000"
|
||||
x="357.35019"
|
||||
x="370.59088"
|
||||
y="205.03008"
|
||||
id="text362"><tspan
|
||||
sodipodi:role="line"
|
||||
x="357.35019"
|
||||
x="370.59088"
|
||||
y="205.03008"
|
||||
id="tspan363"
|
||||
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:8px;font-family:'Liberation Serif';-inkscape-font-specification:'Liberation Serif Bold';text-align:center;text-anchor:middle">PIM</tspan></text>
|
||||
|
||||
|
Before Width: | Height: | Size: 346 KiB After Width: | Height: | Size: 346 KiB |
|
Before Width: | Height: | Size: 95 KiB After Width: | Height: | Size: 95 KiB |
@@ -1,9 +1,14 @@
|
||||
## Conclusion and Future Work
|
||||
<hr/>
|
||||
|
||||
- achievable speedup of 17.6 × and 9.0 × hypothetical infinite compute system
|
||||
- lower bound
|
||||
- linux driver implementation
|
||||
- comparison with real neural network workloads
|
||||
- consider replacing library approach with compiler approach
|
||||
- power comparison, power models needed
|
||||
<br>
|
||||
|
||||
A speedup of 17.6× and 9.0× for the hypothetical infinite compute system has been achieved
|
||||
|
||||
<br>
|
||||
|
||||
Future work:
|
||||
- Implementation of Linux driver
|
||||
- Comparison with complete neural networks
|
||||
- Consider replacing library approach with compiler approach
|
||||
- Implement a power model to analyze the power efficiency gains
|
||||
|
||||
@@ -1,13 +1,23 @@
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /dramsys.svg
|
||||
figureCaption: The PIM-HBM model integrated into DRAMSys
|
||||
---
|
||||
|
||||
## Virtual Prototype
|
||||
### Processing Units
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
|
||||
- Integrate DRAMSys into gem5
|
||||
- Implement PIM-HBM virtual prototype in DRAM model
|
||||
|
||||
<br>
|
||||
|
||||
<div class="flex justify-center items-center">
|
||||
<img src="/dramsys.svg">
|
||||
</div>
|
||||
|
||||
<!--
|
||||
- VP interprets the programmed microkernel
|
||||
- not yet drop-in replacement
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure-side
|
||||
figureUrl: /data_structures.svg
|
||||
@@ -18,12 +28,15 @@ figureCaption: Data structures for instructions and register files
|
||||
### Software Library
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- Software support library
|
||||
- Provides data structures for PIM-HBM
|
||||
- Adhering special memory layout requirements
|
||||
#### Software support library
|
||||
|
||||
<br>
|
||||
|
||||
- Provides data structures for operand data and microkernels
|
||||
- Executes programmed microkernels
|
||||
|
||||
---
|
||||
|
||||
@@ -1,33 +1,41 @@
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /world_energy.svg
|
||||
figureCaption: Total energy of computing
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Introduction
|
||||
### Energy Demand of Applications
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
|
||||
- Total compute energy approaches world's energy production
|
||||
|
||||
--> drastic improvements in energy efficiency needed
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="/world_energy.svg">
|
||||
</div>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
SRC. „Decadal Plan for Semiconductors“, Januar 2021. https://www.src.org/about/decadal-plan/.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /gpt.svg
|
||||
figureCaption: Roofline model of GPT revisions
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Introduction
|
||||
### Memory Bound Workloads
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
|
||||
#### Roofline model of GPT revisions<sup>1</sup>
|
||||
|
||||
<br>
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="/gpt.svg">
|
||||
</div>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Ivo Bolsens. „Scalable AI Architectures for Edge and Cloud“, 2023.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
@@ -79,6 +79,10 @@ clicks: 1
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<!--
|
||||
To summarize...
|
||||
-->
|
||||
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
@@ -94,8 +98,8 @@ clicks: 1
|
||||
<v-clicks>
|
||||
|
||||
- Inside the memory subarray
|
||||
- In the PSA region near a subarray
|
||||
- Outside the bank in its peripheral region
|
||||
- Near the subarray in the PSA output region
|
||||
- Near the bank in its peripheral region
|
||||
- In the I/O region of the memory
|
||||
|
||||
</v-clicks>
|
||||
@@ -120,7 +124,7 @@ clicks: 1
|
||||
<div v-click class="text-xl"> The nearer the computation is to the memory cells, the higher the achievable bandwidth! </div>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Sudarshan et al. „A Critical Assessment of DRAM-PIM Architectures - Trends, Challenges and Solutions“, 2022.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
@@ -141,19 +145,27 @@ clicks: 1
|
||||
- more traditional accelerator approach
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /hbm-pim.svg
|
||||
figureCaption: Architecture of PIM-HBM
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
### Samsung's HBM-PIM
|
||||
### Samsung's PIM-HBM
|
||||
<hr/>
|
||||
|
||||
|
||||
<br>
|
||||
|
||||
- Real-world PIM implementation based on HBM2
|
||||
- PIM units embedded at the bank level
|
||||
|
||||
<br>
|
||||
|
||||
|
||||
<div class="flex justify-center items-center">
|
||||
<img src="/hbm-pim.svg">
|
||||
</div>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
@@ -167,19 +179,23 @@ figureFootnoteNumber: 1
|
||||
- All-Bank-PIM (AB-PIM)
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /pu.svg
|
||||
figureCaption: Architecture of a PIM processing unit
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
### Samsung's HBM-PIM
|
||||
### Samsung's PIM-HBM
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
|
||||
- Two 16-wide 16-bit FPUs
|
||||
- Register files and control unit
|
||||
|
||||
<div class="flex justify-center items-center">
|
||||
<img src="/pu.svg">
|
||||
</div>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
@@ -201,15 +217,14 @@ figureFootnoteNumber: 1
|
||||
layout: figure
|
||||
figureUrl: /gemv.svg
|
||||
figureCaption: Procedure to perform a (128×8)×(128) GEMV operation
|
||||
figureFootnoteNumber: 1
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
### Samsung's HBM-PIM
|
||||
### Samsung's PIM-HBM
|
||||
<hr/>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
@@ -221,7 +236,7 @@ figureCaption: Mapping of the weight matrix onto the memory banks
|
||||
---
|
||||
|
||||
## Processing-in-Memory
|
||||
### Samsung's HBM-PIM
|
||||
### Samsung's PIM-HBM
|
||||
<hr/>
|
||||
|
||||
<!--
|
||||
@@ -234,8 +249,14 @@ figureCaption: Mapping of the weight matrix onto the memory banks
|
||||
### Research
|
||||
<hr/>
|
||||
|
||||
simulation models needed
|
||||
<br>
|
||||
<br>
|
||||
<br>
|
||||
<br>
|
||||
|
||||
research should not only focus on hardware but also explore the software side!
|
||||
- To analyze the performance gains of PIM, simulation models are needed
|
||||
- Research should not only focus on hardware but also explore the software side
|
||||
|
||||
deswegen baue ich einen virutal protoype
|
||||
<br>
|
||||
|
||||
- In the following, a virtual prototype of PIM-HBM is implemented
|
||||
|
||||
@@ -2,7 +2,6 @@
|
||||
### Microbenchmarks
|
||||
<hr/>
|
||||
|
||||
<br>
|
||||
<br>
|
||||
|
||||
<div class="grid grid-cols-2 gap-4">
|
||||
@@ -15,11 +14,16 @@
|
||||
|
||||
- Vector-Matrix benchmarks (BLAS level 2)
|
||||
- GEMV: $z = A \cdot x$
|
||||
- DNN Layer: $z = ReLU(A \cdot x)$
|
||||
- DNN:
|
||||
- $f(x) = z = ReLU(A \cdot x)$
|
||||
- $z_{n+1} = f(z_n)$
|
||||
- 5 layers in total
|
||||
|
||||
</div>
|
||||
<div>
|
||||
|
||||
<br>
|
||||
|
||||
| Level | Vector | GEMV | DNN |
|
||||
|-------|--------|---------------|---------------|
|
||||
| X1 | (2M) | (1024 x 4096) | (256 x 256) |
|
||||
@@ -27,6 +31,8 @@
|
||||
| X3 | (8M) | (4096 x 8192) | (1024 x 1024) |
|
||||
| X4 | (16M) | (4096 x 8192) | (2048 x 2048) |
|
||||
|
||||
Operand Dimensions
|
||||
|
||||
</div>
|
||||
</div>
|
||||
|
||||
@@ -36,11 +42,18 @@
|
||||
### System Configuration
|
||||
<hr/>
|
||||
|
||||
- Two system configurations:
|
||||
- ARM 3GHz
|
||||
- ARM Infinite
|
||||
<br>
|
||||
<br>
|
||||
|
||||
- TODO ... GPU und so
|
||||
- Two simulated systems:
|
||||
- Generic ARM systems
|
||||
- Infinite compute ARM system
|
||||
|
||||
<br>
|
||||
|
||||
- Two real GPUs using HBM2:
|
||||
- AMD RX Vega 56
|
||||
- NVIDIA V100
|
||||
|
||||
---
|
||||
layout: figure
|
||||
@@ -49,7 +62,7 @@ figureCaption: Speedups of PIM compared to non-PIM
|
||||
---
|
||||
|
||||
## Simulations
|
||||
### Speedups / ARM System
|
||||
### Speedups / Generic ARM System
|
||||
<hr/>
|
||||
|
||||
---
|
||||
@@ -74,11 +87,18 @@ figureFootnoteNumber: 1
|
||||
<hr/>
|
||||
|
||||
<Footnotes separator>
|
||||
<Footnote :number=1>
|
||||
<Footnote>
|
||||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||||
</Footnote>
|
||||
</Footnotes>
|
||||
|
||||
<!--
|
||||
- GEMV matches good
|
||||
- ADD shows deviation
|
||||
|
||||
-> differences in hardware architecture
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /runtimes_vector.svg
|
||||
@@ -89,6 +109,11 @@ figureCaption: Runtimes for Vector Benchmarks
|
||||
### Runtimes / Vector Benchmarks
|
||||
<hr/>
|
||||
|
||||
<!--
|
||||
- Real GPUs use multiple memory channels
|
||||
- Also architectural differences
|
||||
-->
|
||||
|
||||
---
|
||||
layout: figure
|
||||
figureUrl: /runtimes_matrix.svg
|
||||
|
||||