During GPUFS checkpoint restore, doorbells callbacks are created based
on certain MQD attributes. These callbacks are required to create new
SDMA doorbells. If these attributes are not present in the checkpoint,
the restore hangs indefinitely waiting for ioctl calls that access these
doorbells to finish execution. This commit adds the attributes required
for checkpoint restore to proceed.
Change-Id: Id3d1b7a2627d4c50133d923096495957a233f675
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70077
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matthew Poremba <matthew.poremba@amd.com>
The GPUFS checkpoint restoration mechanism expects to find a
PM4MapQueues packet in the checkpoint. Since this was not being
checkpointed, the restore phase retrieved a null packet which led to a
segmentation fault. This commit adds PM4MapQueues to the checkpoint and
restores it when deserializing the checkpoint
Change-Id: Ib74a9f36fe89d740a74f94314ada41ecc363abe9
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69298
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Currently when RLC queues (user mode queues) are mapped, the read/write
pointers of the ring buffer are set to zero. However, these queues could
be unmapped and then remapped later. In that situation the read/write
pointers should be the previous value before unmapping occurred. Since
the read pointer gets reset to zero, the queue begins reading from the
start of the ring, which usually contains older packets. There is a 99%
chance those packets contain addresses which are no longer in the page
tables which will cause a page fault.
To fix this we update the MQD with the current read/write pointer values
and then writeback the MQD to memory when the queue is unmapped. This
requires adding a pointer to the MQD and the host address of the MQD
where it should be written back to. The interface for registering RLC
queue is also simplified. Since we need to pass the MQD anyway, we can
get values from it as well.
Fixes b+tree and streamcluster from rodinia (when using RLC queues).
Change-Id: Ie5dad4d7d90ea240c3e9f0cddf3e844a3cd34c4f
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65791
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
The current SDMA wrap around handling only considers the ring buffer
location as seen by the GPU. Eventually when the end of the SDMA ring
buffer is reached, the driver waits until the rptr written back to the
host catches up to what the driver sees before wrapping around back to
the beginning of the buffer. This writeback currently does not happen at
all, causing hangs for applications with a lot of SDMA commands.
This changeset first fixes the sizes of the queues, especially RLC
queues, so that the wrap around occurs in the correct place. Second, we
now store the rptr writeback address and the absoluate (unwrapped) rptr
value in each SDMA queue. The absolulte rptr is what the driver sends to
the device and what it expects to be written back.
This was tested with an application which basically does a few hundred
thousand hipMemcpy() calls in a loop. It should also fix the issue with
pannotia BC in fullsystem mode.
Change-Id: I53ebdcc6b02fb4eb4da435c9a509544066a97069
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65351
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
The PM4 release_mem packet is used as a DMA fence in the driver. It
specifies which queue the interrupt came from by encoding the me, pipe,
and queue fields from the map_queue packet into the interrupt ring ID.
Currently these fields are incorrect because (1) the order in the
bitfield is backwards, (2) the queue constructor assigns a pointer to
the PM4MapQueue packet containing this data to the dmaBuffer which gets
deleted in short order, and (3) the order of the encoding of ring ID is
incorrect.
This change fixes these issues by (1) placing the struct vales in
correct order, (2) creating a const copy of the dmaBuffer on
construction, and (3) using the ring ID encoding expected by the driver:
https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/blob/roc-4.3.x/
drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c#L5989
Change-Id: I72c382980e57573f8a8a6879912c4139c7e2f505
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65095
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
The PM4 NOP header is used to insert spaces in the PM4 ring and can
therefore be any size. This includes zero. A size of zero is denoted by
a value of 0x3fff in the NOP packet header. Currently we assume this
means the remainder of the PM4 queue up to the wptr is empty/NOPs. This
is not always true.
This changeset reworks the PM4 NOP packet to handle the value of 0x3fff
as a special value and advances the rptr by 0 bytes. This fixes issues
where there were additional packets in the queue which were being
skipped over by fast forwarding. Since those packets could be anything,
that leads to undefined behavior afterwards.
Change-Id: I3f5c3f4b7dd50f93ba503fea97454a9d41771e30
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65094
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
The LDS and scratch aperture base and limits are hardcoded to some
values that are useful for SE mode. In reality, these are chosen by the
driver so we need to honor whatever values the driver passes so that
when addresses are calculated they fall into the correct aperture to
route flat instructions to those apertures.
This overwrites the default hardcoded values for LDS and scratch base
and limit using the values providing by the driver in a MAP_PROCESS
packet.
Change-Id: I0e194a26631f697819d8aaecf1bf346a7b7c7026
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61656
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
These instructions are supposed to be read/writing special shader
hardware registers. Currently they are getting/setting to an SGPR. This
results in getting incorrect registers at best and clobbering an SGPR
being used by an application at worst. Furthermore, some registers need
to be set in the shader and the application will never (can never) set
them.
This patch overhauls the getreg/setreg instructions to use different
storage in the shader. The values will be updated either via setreg from
an application (e.g., mode register) or set by a PM4 MAP_PROCESS.
Change-Id: Ie5e5d552bd04dc47f5b35b5ee40a569ae345abac
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61655
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
The AQL queue size is currently hardcoded to 64kB. For longer running
applications this causes the circular queue to wrap before reaching the
real end of the queue. Add the computation for queue size instead.
Previously longer applications (e.g., bc in pannotia) were hanging
around 4k kernels. With change the application launches 10k+ kernels.
Change-Id: I6c31677c1799a3c9ce28cf4e7e79efcb987e3b7f
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/59449
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Add logic to collect pointers to all GPU TLBs in full system. Implement
the invalid TLBs PM4 packet. The invalidate is done functionally since
there is really no benefit to simulate it with timing and there is no
support in the TLB to do so. This allow application with much larger
data sets which may reuse device memory pages to work in gem5 without
possibly crashing due to a stale translation being leftover in the TLB.
Change-Id: Ia30cce02154d482d8f75b2280409abb8f8375c24
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/58470
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
The PM4 packet processor is handling all non-HSA GPU packets such
as packets for (un)mapping HSA queues. This commit pulls many
Linux structs and defines out into their own files for clarity.
Finally, it implements the VMID related functions in AMDGPU device.
Change-Id: I5f0057209305404df58aff2c4cd07762d1a31690
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53068
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>