master-thesis-presentation/slides/pim.md

## Processing-in-Memory
### Applicable Workloads
<hr/>

- Fully connected layers have a large weight matrix
  - Weight matrix does not fit onto on-chip cache
  - No data reuse in the matrix

<div class="flex justify-center">
<img src="/dnn.svg">
</div>

---
preload: false
clicks: 1
---

## Processing-in-Memory
### Applicable Workloads
<hr/>

- Convolutional layers have a small filter matrix
  - Matrix does fit onto on-chip cache
  - Excessive data reuse in the matrix

<br>

<Transform :scale="1.4">
  <div class="absolute left-175px top-1px">
    <img src="/cnn_input.svg">
  </div>
  <div v-if="$slidev.nav.clicks === 0" class="absolute left-175px">
    <img src="/cnn_filter.svg">
  </div>
  <div v-if="$slidev.nav.clicks === 1"
  v-motion
  :initial="{ x: 175, y: 0}"
  :enter="{ x: 335, y: 0, transition: { duration: 5000 }}">
    <img src="/cnn_filter.svg">
  </div>
</Transform>

---

## Processing-in-Memory
### Applicable Workloads
<hr/>

<br>
<br>
<br>

<div class="grid grid-cols-2 gap-4">
<div>

### Suitable candidates for PIM:

<br>

 - Fully connected layers in multilayer perceptrons (MLPs)
 - Layers in recurrent neural networks (RNNs)

</div>
<div>

### Less suitable candidates for PIM:

<br>

   - Convolutional neural networks (CNNs)

</div>
</div>

<!--
To summarize...
-->

---

## Processing-in-Memory
### Architectures
<hr/>

<br>
<br>

<div class="grid grid-cols-2 gap-4">
<div>

<v-clicks>

- Inside the memory subarray
- Near the subarray in the PSA output region
- Near the bank in its peripheral region
- In the I/O region of the memory

</v-clicks>

</div>
<div>

<img v-click="[0,1]" class="absolute right-80px top-150px" src="/pim_positions_0.svg">
<img v-click="[1,2]" class="absolute right-80px top-150px" src="/pim_positions_1.svg">
<img v-click="[2,3]" class="absolute right-80px top-150px" src="/pim_positions_2.svg">
<img v-click="[3,4]" class="absolute right-80px top-150px" src="/pim_positions_3.svg">
<img v-click="4" class="absolute right-80px top-150px" src="/pim_positions_4.svg">

</div>
</div>

<br>
<br>
<br>
<br>

<div v-click class="text-xl"> The nearer the computation is to the memory cells, the higher the achievable bandwidth! </div>

<Footnotes separator>
  <Footnote>
  Sudarshan et al. „A Critical Assessment of DRAM-PIM Architectures - Trends, Challenges and Solutions“, 2022.
</Footnote>
</Footnotes>

<!--
- Architecture space of PIM:
- Inside the memory SA
  - simple bulk logic

- Near SA in PSA output region
  - logic gates in the region

- Near a bank in its peripheral region
  - computation units with control

- I/O region of memory
  - limited by memory bus
-->

---

## Processing-in-Memory
### Samsung's PIM-HBM
<hr/>


<br>

- Real-world PIM implementation based on HBM2
- PIM units embedded at the bank level

<br>


<div class="flex justify-center items-center">
<img src="/hbm-pim.svg">
</div>

<Footnotes separator>
  <Footnote>
  Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
</Footnote>
</Footnotes>

<!--
- One PIM unit shared by two banks
- 16-wide SIMD FPUs are 16-wide
- All-Bank mode: All PIM units operate in parallel
-->

---

## Processing-in-Memory
### Samsung's PIM-HBM | Processing Unit
<hr/>

<br>

- Two 16-wide 16-bit FPUs
- Register files and control unit

<br>

#### Instructions:
- Control: NOP, JUMP, EXIT
- Data: MOV (ReLU), FILL
- Arithmetic: ADD, MUL, MAC, MAD

<img class="absolute right-80px top-180px" src="/pu.svg">

<Footnotes separator>
  <Footnote>
  Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
</Footnote>
</Footnotes>

<!--
- Two SIMD FPUs
  - ADD
  - MUL

- CRF: 32 instructions, stores the program
- GRF: 16 entries, one memory fetch
- SRF: 16 entries

- Control units executes one instruction when RD or WR command is issued
-->

---

## Processing-in-Memory
### Samsung's PIM-HBM | GEMV Operation
<hr/>

<img v-click="[0,1]" class="absolute right-125px top-150px" src="/gemv_normal.svg">
<img v-click="1"     class="absolute right-10px top-150px" src="/gemv_interleaved.svg">

---

## Processing-in-Memory
### Samsung's PIM-HBM | GEMV Operation
<hr/>

<img v-click="[0,1]" class="absolute right-250px top-150px" src="/gemv.svg">
<img v-click="[1,2]" class="absolute right-250px top-150px" src="/gemv_0.svg">
<img v-click="[2,3]" class="absolute right-250px top-150px" src="/gemv_1.svg">
<img v-click="[3,4]" class="absolute right-250px top-150px" src="/gemv_2.svg">
<img v-click="[4,5]" class="absolute right-250px top-150px" src="/gemv_3.svg">
<img v-click="5"     class="absolute right-250px top-150px" src="/gemv_4.svg">

<Footnotes separator>
  <Footnote>
  Lee et al. „Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product“, 2021.
</Footnote>
</Footnotes>

<!--
- Procedure of GEMV operation
- multiple cycles
- each PIM unit operatates on one matrix row
- partial sum, reduced by host
-->

---

## Processing-in-Memory
### Research
<hr/>

<br>
<br>
<br>
<br>

- To analyze the performance gains of PIM, simulations are needed
- Research should not only focus on hardware but also explore the programmability

<br>

- In the following, a virtual prototype of PIM-HBM is implemented