261 lines
5.2 KiB
Markdown
261 lines
5.2 KiB
Markdown
## Processing-in-Memory
|
||
### Applicable Workloads
|
||
<hr/>
|
||
|
||
- Fully connected layers have a large weight matrix
|
||
- Weight matrix does not fit onto on-chip cache
|
||
- No data reuse in the matrix
|
||
|
||
<div class="flex justify-center">
|
||
<img src="/dnn.svg">
|
||
</div>
|
||
|
||
---
|
||
preload: false
|
||
clicks: 1
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Applicable Workloads
|
||
<hr/>
|
||
|
||
- Convolutional layers have a small filter matrix
|
||
- Matrix does fit onto on-chip cache
|
||
- Excessive data reuse in the matrix
|
||
|
||
<br>
|
||
|
||
<Transform :scale="1.4">
|
||
<div class="absolute left-175px top-1px">
|
||
<img src="/cnn_input.svg">
|
||
</div>
|
||
<div v-if="$slidev.nav.clicks === 0" class="absolute left-175px">
|
||
<img src="/cnn_filter.svg">
|
||
</div>
|
||
<div v-if="$slidev.nav.clicks === 1"
|
||
v-motion
|
||
:initial="{ x: 175, y: 0}"
|
||
:enter="{ x: 335, y: 0, transition: { duration: 5000 }}">
|
||
<img src="/cnn_filter.svg">
|
||
</div>
|
||
</Transform>
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Applicable Workloads
|
||
<hr/>
|
||
|
||
<br>
|
||
<br>
|
||
<br>
|
||
|
||
<div class="grid grid-cols-2 gap-4">
|
||
<div>
|
||
|
||
### Suitable candidates for PIM:
|
||
|
||
<br>
|
||
|
||
- Fully connected layers in multilayer perceptrons (MLPs)
|
||
- Layers in recurrent neural networks (RNNs)
|
||
|
||
</div>
|
||
<div>
|
||
|
||
### Less suitable candidates for PIM:
|
||
|
||
<br>
|
||
|
||
- Convolutional neural networks (CNNs)
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<!--
|
||
To summarize...
|
||
-->
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Architectures
|
||
<hr/>
|
||
|
||
<br>
|
||
<br>
|
||
|
||
<div class="grid grid-cols-2 gap-4">
|
||
<div>
|
||
|
||
<v-clicks>
|
||
|
||
- Inside the memory subarray
|
||
- Near the subarray in the PSA output region
|
||
- Near the bank in its peripheral region
|
||
- In the I/O region of the memory
|
||
|
||
</v-clicks>
|
||
|
||
</div>
|
||
<div>
|
||
|
||
<img v-click="[0,1]" class="absolute right-80px top-150px" src="/pim_positions_0.svg">
|
||
<img v-click="[1,2]" class="absolute right-80px top-150px" src="/pim_positions_1.svg">
|
||
<img v-click="[2,3]" class="absolute right-80px top-150px" src="/pim_positions_2.svg">
|
||
<img v-click="[3,4]" class="absolute right-80px top-150px" src="/pim_positions_3.svg">
|
||
<img v-click="4" class="absolute right-80px top-150px" src="/pim_positions_4.svg">
|
||
|
||
</div>
|
||
</div>
|
||
|
||
<br>
|
||
<br>
|
||
<br>
|
||
<br>
|
||
|
||
<div v-click class="text-xl"> The nearer the computation is to the memory cells, the higher the achievable bandwidth! </div>
|
||
|
||
<Footnotes separator>
|
||
<Footnote>
|
||
Sudarshan et al. „A Critical Assessment of DRAM-PIM Architectures - Trends, Challenges and Solutions“, 2022.
|
||
</Footnote>
|
||
</Footnotes>
|
||
|
||
<!--
|
||
- Architecture space of PIM:
|
||
- Inside the memory SA
|
||
- simple bulk logic
|
||
|
||
- Near SA in PSA output region
|
||
- logic gates in the region
|
||
|
||
- Near a bank in its peripheral region
|
||
- computation units with control
|
||
|
||
- I/O region of memory
|
||
- limited by memory bus
|
||
-->
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Samsung's PIM-HBM
|
||
<hr/>
|
||
|
||
|
||
<br>
|
||
|
||
- Real-world PIM implementation based on HBM2
|
||
- PIM units embedded at the bank level
|
||
|
||
<br>
|
||
|
||
|
||
<div class="flex justify-center items-center">
|
||
<img src="/hbm-pim.svg">
|
||
</div>
|
||
|
||
<Footnotes separator>
|
||
<Footnote>
|
||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||
</Footnote>
|
||
</Footnotes>
|
||
|
||
<!--
|
||
- One PIM unit shared by two banks
|
||
- 16-wide SIMD FPUs are 16-wide
|
||
- All-Bank mode: All PIM units operate in parallel
|
||
-->
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Samsung's PIM-HBM | Processing Unit
|
||
<hr/>
|
||
|
||
<br>
|
||
|
||
- Two 16-wide 16-bit FPUs
|
||
- Register files and control unit
|
||
|
||
<br>
|
||
|
||
#### Instructions:
|
||
- Control: NOP, JUMP, EXIT
|
||
- Data: MOV (ReLU), FILL
|
||
- Arithmetic: ADD, MUL, MAC, MAD
|
||
|
||
<img class="absolute right-80px top-180px" src="/pu.svg">
|
||
|
||
<Footnotes separator>
|
||
<Footnote>
|
||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||
</Footnote>
|
||
</Footnotes>
|
||
|
||
<!--
|
||
- Two SIMD FPUs
|
||
- ADD
|
||
- MUL
|
||
|
||
- CRF: 32 instructions, stores the program
|
||
- GRF: 16 entries, one memory fetch
|
||
- SRF: 16 entries
|
||
|
||
- Control units executes one instruction when RD or WR command is issued
|
||
-->
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Samsung's PIM-HBM | GEMV Operation
|
||
<hr/>
|
||
|
||
<img v-click="[0,1]" class="absolute right-125px top-150px" src="/gemv_normal.svg">
|
||
<img v-click="1" class="absolute right-10px top-150px" src="/gemv_interleaved.svg">
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Samsung's PIM-HBM | GEMV Operation
|
||
<hr/>
|
||
|
||
<img v-click="[0,1]" class="absolute right-250px top-150px" src="/gemv.svg">
|
||
<img v-click="[1,2]" class="absolute right-250px top-150px" src="/gemv_0.svg">
|
||
<img v-click="[2,3]" class="absolute right-250px top-150px" src="/gemv_1.svg">
|
||
<img v-click="[3,4]" class="absolute right-250px top-150px" src="/gemv_2.svg">
|
||
<img v-click="[4,5]" class="absolute right-250px top-150px" src="/gemv_3.svg">
|
||
<img v-click="5" class="absolute right-250px top-150px" src="/gemv_4.svg">
|
||
|
||
<Footnotes separator>
|
||
<Footnote>
|
||
Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
|
||
</Footnote>
|
||
</Footnotes>
|
||
|
||
<!--
|
||
- Procedure of GEMV operation
|
||
- multiple cycles
|
||
- each PIM unit operatates on one matrix row
|
||
- partial sum, reduced by host
|
||
-->
|
||
|
||
---
|
||
|
||
## Processing-in-Memory
|
||
### Research
|
||
<hr/>
|
||
|
||
<br>
|
||
<br>
|
||
<br>
|
||
<br>
|
||
|
||
- To analyze the performance gains of PIM, simulations are needed
|
||
- Research should not only focus on hardware but also explore the programmability
|
||
|
||
<br>
|
||
|
||
- In the following, a virtual prototype of PIM-HBM is implemented
|