mem,gpu-compute: Implement GPU TCC directed invalidate (#1011)
The GPU device currently supports large BAR which means that the driver can write directly to GPU memory over the PCI bus without using SDMA or PM4 packets. The gem5 PCI interface only provides an atomic interface for BAR reads/writes, which means the values cannot go through timing mode Ruby caches. This causes bugs as the TCC cache is allowed to keep clean data between kernels for performance reasons. If there is a BAR write directly to memory bypassing the cache, the value in the cache is stale and must be invalidated. In this commit a TCC invalidate is generated for all writes over PCI that go directly to GPU memory. This will also invalidate TCP along the way if necessary. This currently relies on the driver synchonization which only allows BAR writes in between kernels. Therefore, the cache should only be in I or V state. To handle a race condition between invalidates and launching the next kernel, the invalidates return a response and the GPU command processor will wait for all TCC invalidates to be complete before launching the next kernel. This fixes issues with stale data in nanoGPT and possibly PENNANT.
This commit is contained in:
@@ -420,6 +420,12 @@ AMDGPUDevice::writeFrame(PacketPtr pkt, Addr offset)
|
||||
{
|
||||
DPRINTF(AMDGPUDevice, "Wrote framebuffer address %#lx\n", offset);
|
||||
|
||||
for (auto& cu: CP()->shader()->cuList) {
|
||||
auto system = CP()->shader()->gpuCmdProc.system();
|
||||
Addr aligned_addr = offset & ~(system->cacheLineSize() - 1);
|
||||
cu->sendInvL2(aligned_addr);
|
||||
}
|
||||
|
||||
Addr aperture = gpuvm.getFrameAperture(offset);
|
||||
Addr aperture_offset = offset - aperture;
|
||||
|
||||
|
||||
Reference in New Issue
Block a user