mem,gpu-compute: Implement GPU TCC directed invalidate

The GPU device currently supports large BAR which means that the driver can write directly to GPU memory over the PCI bus without using SDMA or PM4 packets. The gem5 PCI interface only provides an atomic interface for BAR reads/writes, which means the values cannot go through timing mode Ruby caches. This causes bugs as the TCC cache is allowed to keep clean data between kernels for performance reasons. If there is a BAR write directly to memory bypassing the cache, the value in the cache is stale and must be invalidated. In this commit a TCC invalidate is generated for all writes over PCI that go directly to GPU memory. This will also invalidate TCP along the way if necessary. This currently relies on the driver synchonization which only allows BAR writes in between kernels. Therefore, the cache should only be in I or V state. To handle a race condition between invalidates and launching the next kernel, the invalidates return a response and the GPU command processor will wait for all TCC invalidates to be complete before launching the next kernel. This fixes issues with stale data in nanoGPT and possibly PENNANT. Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf
2024-03-15 17:40:42 -05:00
parent 833392e7b2
commit 1d64669473
14 changed files with 236 additions and 3 deletions
--- a/src/gpu-compute/gpu_command_processor.cc
+++ b/src/gpu-compute/gpu_command_processor.cc
@@ -41,6 +41,7 @@
 #include "debug/GPUKernelInfo.hh"
 #include "dev/amdgpu/amdgpu_device.hh"
 #include "gpu-compute/dispatcher.hh"
+#include "gpu-compute/shader.hh"
 #include "mem/abstract_mem.hh"
 #include "mem/packet_access.hh"
 #include "mem/se_translating_port_proxy.hh"
@@ -126,6 +127,21 @@ GPUCommandProcessor::submitDispatchPkt(void *raw_pkt, uint32_t queue_id,
    unsigned akc_alignment_granularity = 64;
    assert(!(disp_pkt->kernel_object & (akc_alignment_granularity - 1)));

+    /**
+     * Make sure there is not a race condition with invalidates in the L2
+     * cache. The full system driver may write directly to memory using
+     * large BAR while the L2 cache is allowed to keep data in the valid
+     * state between kernel launches. This is a rare event but is required
+     * for correctness.
+     */
+    if (shader()->getNumOutstandingInvL2s() > 0) {
+        DPRINTF(GPUCommandProc,
+                "Deferring kernel launch due to outstanding L2 invalidates\n");
+        shader()->addDeferredDispatch(raw_pkt, queue_id, host_pkt_addr);
+
+        return;
+    }
+
    /**
     * Need to use a raw pointer for DmaVirtDevice API. This is deleted
     * in the dispatchKernelObject method.