derek/gem5 - gem5 - Gitea: Git with a cup of tea

derek/gem5

Author	SHA1	Message	Date
Matthew Poremba	3b35e73eb8	dev-amdgpu: Implement SDMA constant fill This SDMA packet is much more common starting around ROCm 5.4. Previously this was mostly used to clear page tables after an application ended and was therefore left unimplemented. It is now used for basic operation like device memsets. This patch implements constant fill as it is now necessary. Change-Id: I9b2cf076ec17f5ed07c20bb820e7db0c082bbfbc	2023-07-30 13:17:05 -05:00
Matthew Poremba	8b91ac6f8d	dev-amdgpu: Refactor MMIO interface for SDMA engines Currently the amdgpu simulated device is assumed to be a Vega10. As a result there are a few things that are hardcoded. One of those is the number of SDMAs. In order to add a newer device, such as MI100+, we need to enable a flexible number of SDMAs. In order to support a variable number of SDMAs and with the MMIO offsets of each device being potentially different, the MMIO interface for SDMAs is changed to use an SDMA class method dispatch table with forwards a 32-bit value from the MMIO packet to the MMIO functions in SDMA of the format `void method(uint32_t)`. Several changes are made to enable this: - Allow the SDMA to have a variable MMIO base and size. These are configured in python. - An SDMA class method dispatch table which contains the MMIO offset relative to the SDMA's MMIO base address. - An updated writeMMIO method to iterate over the SDMA MMIO address ranges and call the appropriate SDMA MMIO method which matches the MMIO offset. - Moved all SDMA related MMIO data bit twiddling, masking, etc. into the MMIO methods themselves instead of in the writeMMIO method in SDMAEngine. Change-Id: Ifce626f84d52f9e27e4438ba4e685e30dbf06dbc Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70040 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2023-04-28 00:48:35 +00:00
Matthew Poremba	39b5b5e511	dev-amdgpu: Fix address in POLL_REGMEM SDMA packet The address for the POLL_REGMEM packet should not be shifted when the mode is 1 (memory). Relevant driver code below is not shifting the address. The shift is causing a page fault due to the incorrect address. This changeset removes the shift so the correct address is translated. https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/ roc-4.3.x/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c#L903 Change-Id: I7a0ec3245ca14376670df24c5d3773958c08d751 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67877 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2023-02-14 15:36:56 +00:00
Matthew Poremba	eee42275ee	dev-amdgpu: Writeback RLC queue MQD when unmapped Currently when RLC queues (user mode queues) are mapped, the read/write pointers of the ring buffer are set to zero. However, these queues could be unmapped and then remapped later. In that situation the read/write pointers should be the previous value before unmapping occurred. Since the read pointer gets reset to zero, the queue begins reading from the start of the ring, which usually contains older packets. There is a 99% chance those packets contain addresses which are no longer in the page tables which will cause a page fault. To fix this we update the MQD with the current read/write pointer values and then writeback the MQD to memory when the queue is unmapped. This requires adding a pointer to the MQD and the host address of the MQD where it should be written back to. The interface for registering RLC queue is also simplified. Since we need to pass the MQD anyway, we can get values from it as well. Fixes b+tree and streamcluster from rodinia (when using RLC queues). Change-Id: Ie5dad4d7d90ea240c3e9f0cddf3e844a3cd34c4f Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65791 Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2022-12-01 21:04:05 +00:00
Matthew Poremba	33a36d35de	dev-amdgpu: Store SDMA queue type, use for ring ID Currently the SDMA queue type is guessed in the trap method by looking at which queue in the engine is processing packets. It is possible for both queues to be processing (e.g., one queue sent a DMA and is waiting then switch to another queue), triggering an assert. Instead store the queue type in the queue itself and use that type in trap to determine which ring ID to use for the interrupt packet. Change-Id: If91c458e60a03f2013c0dc42bab0b1673e3dbd84 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65691 Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-11-18 15:30:37 +00:00
Matthew Poremba	c8d687b05c	dev-amdgpu: Fix SDMA ring buffer wrap around The current SDMA wrap around handling only considers the ring buffer location as seen by the GPU. Eventually when the end of the SDMA ring buffer is reached, the driver waits until the rptr written back to the host catches up to what the driver sees before wrapping around back to the beginning of the buffer. This writeback currently does not happen at all, causing hangs for applications with a lot of SDMA commands. This changeset first fixes the sizes of the queues, especially RLC queues, so that the wrap around occurs in the correct place. Second, we now store the rptr writeback address and the absoluate (unwrapped) rptr value in each SDMA queue. The absolulte rptr is what the driver sends to the device and what it expects to be written back. This was tested with an application which basically does a few hundred thousand hipMemcpy() calls in a loop. It should also fix the issue with pannotia BC in fullsystem mode. Change-Id: I53ebdcc6b02fb4eb4da435c9a509544066a97069 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65351 Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-11-09 04:11:35 +00:00
Matthew Poremba	752b696883	dev-amdgpu: Fix SDMA trap ring ID, context SDMA traps are used in the driver as a DMA fence. To pass a fence, the SDMA sends the driver the interrupt context from a trap packet and the ring ID which specifies which queue in the SDMA engine is passing a fence. Currently the interrupt context is using the wrong value in the packet and the ring ID is hard-coded to always be the gfx queue. This changeset uses the correct interrupt context from the SDMA packet and sets the ring ID to either 0 if the gfx queue is currently being processed or 3 if the page queue is being processed. The relevant interrupt service routine in the driver can be found at: https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/roc-4.3.x/ drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c#L2129 Change-Id: Ie4a4a9d6ab1d3bf83bf76bb57a02a91100217b51 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65093 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-11-01 15:34:08 +00:00
Matthew Poremba	7b16b17e61	dev-amdgpu: Chunkify SDMA copies that use device memory The current implementation of SDMA copy calls the GPU memory manager's read/write method one time passing a physical address as the source/destination. This implicitly assumes the physical addresses are contiguous which is generally not true for large allocations. This results in reading from/writing to the wrong address. This changeset fixes the problem by copying large copies in chunks of the minimum possible page size on the GPU (4kB). Each page is translated seperately to ensure the correct physical address. The final copy "done" callback is only used for the last transfer. The transfers should complete in order so the copy command will not complete until all chunks have been copied. Tested and verified on an application with a large allocation (~5GB). Change-Id: I27018a963da7133f5e49dec13b0475c3637c8765 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/64752 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-10-31 14:30:24 +00:00
Matthew Poremba	a648be2338	dev-amdgpu: Add an SDMA data debug flag This debug flag is used to print spammy SDMA DPRINTFs, such as an SDMA copy printing the data of large transfers 8 bytes per line at a time. For those prints, the SDMAEngine flag will now only print the first and last qword of the transfer and the new SDMAData flag is needed for verbose data printing. This makes the SDMAEngine flag still useful for verifying copies in applications with predictable data such as square. Additionally, the memory allocation/deallocation done solely for a print statement is removed in favor of casting the data to the printed type. Change-Id: I18c1918ef9085cca4570f79881ee63d510ccc32f Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/64452 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-10-13 20:17:00 +00:00
Matthew Poremba	6c935657fd	dev-amdgpu: Implement SDMA atomic packet SDMA atomic packets are used in conjunction with RLC queues in SDMA for synchronization similar to how HSA signals are used with BLIT kernels when SDMA is disabled. Implement a skeleton of the SDMA atomic packet methods as well as the atomic add64 operation. The atomic add operation appears to be the only operation used in ROCm, so this implementation is fairly complete. See: https://github.com/RadeonOpenCompute/ROCR-Runtime/blob/ rocm-4.2.x/src/core/runtime/amd_blit_sdma.cpp#L880 Change-Id: I62cc337f2ffe590bdb947b48053760ee8b3a6f32 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63174 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-09-09 04:13:49 +00:00
Matthew Poremba	9ea28bd782	dev-amdgpu: Implement SDMA RLC queue unmapping The unmap queues packet specifies all non-static queues should be unmapped which includes RLC queues in the SMDA. This functionality did not exist before and is added in this changeset. Fixes bug with rodinia_3.0/hip/bfs. Change-Id: I80ca8cf8d89559625b5870745889b0a27916635e Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63173 Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2022-09-09 04:13:49 +00:00
Matthew Poremba	af4251f6ae	dev-amdgpu: Rework SDMA RLC queue data structure There can only ever be two RLC queues maximum. Use this information for a simpler data structure to store doorbell information. The patch changes the std::unordered_map previously used to std::array. This will also be useful in avoiding erase-while-iterating issues needed to unregister all queues at once. Change-Id: I95600e40de51cb1a992a20bcebaf7580ea4d0be8 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63172 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2022-09-09 04:13:49 +00:00
Matthew Poremba	a5dfb0718d	dev-amdgpu: Add user-mode TranslationGen to SDMA RLC queue do translation using user mode addresses. To support this, add the final aperture translation needed to the SDMA engine. Change-Id: I25841e240e3b44f66d26d503ab52b54379daa49a Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63032 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-09-09 04:13:49 +00:00
Matthew Poremba	e0e2806fc4	dev-amdgpu: Add SDMA device translation helper Adding a helper function to remove duplicate code in the copy packet methods. Adds more comments on that code to explain what it is doing. This could in theory also be used in other packets in the future. Change-Id: Id0ed50c87260a2f12f53cb14e927f8c49bb99072 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62718 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com>	2022-09-09 04:13:49 +00:00
Matthew Poremba	3465ff1e7d	dev-amdgpu: Add callbacks for all SDMA GPUMemMgr reqs SDMA write, copy, and ptePde use GPUMemMgr to write to device memory and were dangerously not waiting for write completion which could result in data not being completely written to memory, the data buffer being freed and potentially reused in the simulator, or advancing to the next SDMA packet before the previous one is complete. This changeset adds callbacks for the corresponding "done" methods similar to what the dmaVirt methods call when reading or writing to host memory to fix this issue. Change-Id: I44ce14c13f812ea2a7a76438e12a6ed7c6e0bff0 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62715 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-09-03 16:05:58 +00:00
Matthew Poremba	432329c853	dev-amdgpu: Allow device address source for SDMA COPY Now that the memory manager can DMA read from device memory, allow the linear copy SDMA packet to use device memory as a source. This is used when copying memory from device to host when SDMA engines are enabled. This improves simulation performance over using (simulated) BLIT kernels with SDMA engines disabled. Change-Id: I1f41b294022f0049d154a401c1dc885abb4f223b Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62713 Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-09-03 16:05:58 +00:00
Matthew Poremba	1be246bbe3	dev-amdgpu: Add PM4PP, VMID, Linux definitions The PM4 packet processor is handling all non-HSA GPU packets such as packets for (un)mapping HSA queues. This commit pulls many Linux structs and defines out into their own files for clarity. Finally, it implements the VMID related functions in AMDGPU device. Change-Id: I5f0057209305404df58aff2c4cd07762d1a31690 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53068 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-24 14:59:57 +00:00
Alexandru Dutu	f1772d3505	dev-amdgpu: Add SDMAEngine and GPU device methods SDMAEngine handles copies to device memory. This commit updates sdma_packets.hh style as well. Added several methods needed by SDMAEngine to GPU device including GART table, various getters, and aperture range checkers. Move the MMIO interface from GPUController to SDMAEngine. Create an SDMA MMIO and commands header with only the macros we use so that we don't need to check in multi-thousand line header files from the linux kernel. Keep SOC15 IH client ID macros as that file is small. Change-Id: I986fede90cc1bc16ee56d4e8598cf9283bde034e Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53064 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-24 14:59:57 +00:00

18 Commits