Files
master-thesis-presentation/slides/pim.md
2024-04-09 16:10:45 +02:00

261 lines
5.2 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
## Processing-in-Memory
### Applicable Workloads
<hr/>
- Fully connected layers have a large weight matrix
- Weight matrix does not fit onto on-chip cache
- No data reuse in the matrix
<div class="flex justify-center">
<img src="/dnn.svg">
</div>
---
preload: false
clicks: 1
---
## Processing-in-Memory
### Applicable Workloads
<hr/>
- Convolutional layers have a small filter matrix
- Matrix does fit onto on-chip cache
- Excessive data reuse in the matrix
<br>
<Transform :scale="1.4">
<div class="absolute left-175px top-1px">
<img src="/cnn_input.svg">
</div>
<div v-if="$slidev.nav.clicks === 0" class="absolute left-175px">
<img src="/cnn_filter.svg">
</div>
<div v-if="$slidev.nav.clicks === 1"
v-motion
:initial="{ x: 175, y: 0}"
:enter="{ x: 335, y: 0, transition: { duration: 5000 }}">
<img src="/cnn_filter.svg">
</div>
</Transform>
---
## Processing-in-Memory
### Applicable Workloads
<hr/>
<br>
<br>
<br>
<div class="grid grid-cols-2 gap-4">
<div>
### Suitable candidates for PIM:
<br>
- Fully connected layers in multilayer perceptrons (MLPs)
- Layers in recurrent neural networks (RNNs)
</div>
<div>
### Less suitable candidates for PIM:
<br>
- Convolutional neural networks (CNNs)
</div>
</div>
<!--
To summarize...
-->
---
## Processing-in-Memory
### Architectures
<hr/>
<br>
<br>
<div class="grid grid-cols-2 gap-4">
<div>
<v-clicks>
- Inside the memory subarray
- Near the subarray in the PSA output region
- Near the bank in its peripheral region
- In the I/O region of the memory
</v-clicks>
</div>
<div>
<img v-click="[0,1]" class="absolute right-80px top-150px" src="/pim_positions_0.svg">
<img v-click="[1,2]" class="absolute right-80px top-150px" src="/pim_positions_1.svg">
<img v-click="[2,3]" class="absolute right-80px top-150px" src="/pim_positions_2.svg">
<img v-click="[3,4]" class="absolute right-80px top-150px" src="/pim_positions_3.svg">
<img v-click="4" class="absolute right-80px top-150px" src="/pim_positions_4.svg">
</div>
</div>
<br>
<br>
<br>
<br>
<div v-click class="text-xl"> The nearer the computation is to the memory cells, the higher the achievable bandwidth! </div>
<Footnotes separator>
<Footnote>
Sudarshan et al. „A Critical Assessment of DRAM-PIM Architectures - Trends, Challenges and Solutions“, 2022.
</Footnote>
</Footnotes>
<!--
- Architecture space of PIM:
- Inside the memory SA
- simple bulk logic
- Near SA in PSA output region
- logic gates in the region
- Near a bank in its peripheral region
- computation units with control
- I/O region of memory
- limited by memory bus
-->
---
## Processing-in-Memory
### Samsung's PIM-HBM
<hr/>
<br>
- Real-world PIM implementation based on HBM2
- PIM units embedded at the bank level
<br>
<div class="flex justify-center items-center">
<img src="/hbm-pim.svg">
</div>
<Footnotes separator>
<Footnote>
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product“, 2021.
</Footnote>
</Footnotes>
<!--
- One PIM unit shared by two banks
- 16-wide SIMD FPUs are 16-wide
- All-Bank mode: All PIM units operate in parallel
-->
---
## Processing-in-Memory
### Samsung's PIM-HBM | Processing Unit
<hr/>
<br>
- Two 16-wide 16-bit FPUs
- Register files and control unit
<br>
#### Instructions:
- Control: NOP, JUMP, EXIT
- Data: MOV (ReLU), FILL
- Arithmetic: ADD, MUL, MAC, MAD
<img class="absolute right-80px top-180px" src="/pu.svg">
<Footnotes separator>
<Footnote>
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product“, 2021.
</Footnote>
</Footnotes>
<!--
- Two SIMD FPUs
- ADD
- MUL
- CRF: 32 instructions, stores the program
- GRF: 16 entries, one memory fetch
- SRF: 16 entries
- Control units executes one instruction when RD or WR command is issued
-->
---
## Processing-in-Memory
### Samsung's PIM-HBM | GEMV Operation
<hr/>
<img v-click="[0,1]" class="absolute right-125px top-150px" src="/gemv_normal.svg">
<img v-click="1" class="absolute right-10px top-150px" src="/gemv_interleaved.svg">
---
## Processing-in-Memory
### Samsung's PIM-HBM | GEMV Operation
<hr/>
<img v-click="[0,1]" class="absolute right-250px top-150px" src="/gemv.svg">
<img v-click="[1,2]" class="absolute right-250px top-150px" src="/gemv_0.svg">
<img v-click="[2,3]" class="absolute right-250px top-150px" src="/gemv_1.svg">
<img v-click="[3,4]" class="absolute right-250px top-150px" src="/gemv_2.svg">
<img v-click="[4,5]" class="absolute right-250px top-150px" src="/gemv_3.svg">
<img v-click="5" class="absolute right-250px top-150px" src="/gemv_4.svg">
<Footnotes separator>
<Footnote>
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product“, 2021.
</Footnote>
</Footnotes>
<!--
- Procedure of GEMV operation
- multiple cycles
- each PIM unit operatates on one matrix row
- partial sum, reduced by host
-->
---
## Processing-in-Memory
### Research
<hr/>
<br>
<br>
<br>
<br>
- To analyze the performance gains of PIM, simulations are needed
- Research should not only focus on hardware but also explore the programmability
<br>
- In the following, a virtual prototype of PIM-HBM is implemented