By default the GPU VIPER coherence protocol uses a WT L2 cache.
However it has support for using WB caches (although this is not
tested currently). When using a WB L2 cache for the GPU, this
results in deadlocks with atomics.
Specifically, when an atomic reaches the L2 and the line is
currently in M or W, the line must be written back before the atomic
can be performed. However, the current support has two issues:
a) it never performs the atomic operation -- while VIPER current
assumes all atomics are system scope atomics and thus cannot be
performed at the L2 and this transition requires the dirty line be
written back before performing the atomic, the transition never
performs the atomic nor does the response path handle it.
b) putting the atomic action right after the write back is not
safe because we need to ensure the requests are ordered when they
reach memory -- thus we have to wait until the write back is
acknowledged before it's safe to send/perform the atomic.
To fix this, this change modifies the transition in question to
put the atomic on the stalled requests buffer, which the WBAck will
check when it returns to the L2 (and thus perform the atomic, which
will result in the atomic being sent on to the directory).
This fix has been tested and verified with both the per-checkin and
nightly GPU Ruby Random tester tests (with a WB L2 cache).
Change-Id: I9a43fd985dc71297521f4b05c47288d92c314ac7
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/68978
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
By default the GPU VIPER coherence protocol uses a WT L2 cache.
However it has support for using WB caches (although this is not
tested currently). When using a WB L2 cache for the GPU, this
results in deadlocks with loads.
Specifically, when a load reaches the L2 and the line is currently
in the W state, that line must be written back before the load can
be performed. However, the current transition for this in the L2
did not attempt to retry the load when the WB completes, resulting
in a deadlock. This deadlock can be replicated by running the GPU
Ruby random tester as is with a WB L2 cache instead of a WT L2
cache.
To fix this, this change modifies the transition in question to
put the load on the stalled requests buffer, which the WBAck will
check when it returns to the L2 (and thus perform the load).
This fix has been tested and verified with both the per-checkin and
nightly GPU Ruby Random tester tests (with a WB L2 cache).
Change-Id: Ieec4f61a3070cf9976b8c3ef0cdbd0cc5a1443c6
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/68977
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Tested-by: kokoro <noreply+kokoro@google.com>
Two W->WI transitions, on events RdBlk and Atomic in the GPU L2 cache
coherence protocol do not clear the request from the request queue upon
completing the transition. This action is not performed in the respone
path. This update adds the p_popRequestQueue action to each of these
transitions to remove the stale request from the queue.
Change-Id: Ia2679fe3dd702f4df2bc114f4607ba40c18d6ff1
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67192
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
An earlier commit added support for GLC and SLC AMDGPU instruction
modifiers. These modifiers enable cache bypassing when set. The GLC/SLC
flag information was being threaded through all the way to memory and
back so that appropriate actions could be taken upon receiving a request
and corresponding response. This commit removes the threading and adds
the bypass flag information to TBE. Requests populate this
entry and responses access it to determine the correct set of actions to
execute.
Change-Id: I20ffa6682d109270adb921de078cfd47fb4e137c
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67191
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
The GPU cache models do not support cache bypassing when the GLC or SLC
AMDGPU instruction modifiers are used in a load or store. This commit
adds cache bypass support by introducing new transitions in the
coherence protocol used by the GPU memory system. Now, instructions with
the GLC bit set will not cache in the L1 and instructions with SLC bit
set will not cache in L1 or L2.
Change-Id: Id29a47b0fa7e16a21a7718949db802f85e9897c3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66991
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
A deadlock occured where we got a RdBlk while in W,
which put us in WI while we wait for a writeback to complete.
This would cause the request to be stalled while the writeback
was occuring, but when the writeback completed (WBAck), we never
woke up the requests and thus never completed the RdBlk.
This commit adds a wakeup when we receive a WBAck while in WI.
Change-Id: I01edf1d7a47757b4f680baf9f33a1a6aa37e7e25
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/59352
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Remove the line "For use for simulation and test purposes only" in files
were AMD is the only copyright holder listed in the header. This happens
to be the case for all files where this line exists, removing it
completely from gem5.
Change-Id: I623f266b002f564301b28774f49081099cfc60fd
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53943
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Currently, the GPU VIPER TCC protocol handles races between atomics in
the triggerQueue_in. This in_port does not check for resource
availability, which can cause the trigger queue to execute multiple
times. Although this is the expected behavior, the code for handling
atomic races decrements the atomicDoneCnt flag in the trigger queue,
which is not safe since resource contention may cause it to execute
multiple times.
To resolve this issue, this commit moves the decrementing of this
counter to a new action that is called in an event that happens only
when the race between atomics is detected.
Change-Id: I552fd4f34fdd9ebeec99fb7aeb4eeb7b150f577f
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51368
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
In the GPU VIPER TCC, programs with mixes of atomics and data
accesses to the same address, in the same kernel, can experience
deadlock when large applications (e.g., Pannotia's graph analytics
algorithms) are running on very small GPUs (e.g., the default 4 CU GPU
configuration). In this situation, deadlocks occur due to resource
stalls interacting with the behavior of the current implementation for
handling races between atomic accesses. The specific order of events
causing this deadlock are:
1. TCC is waiting on an atomic to return from directory
2. In the meantime it receives another atomic to the same address -- when
this happens, the TCC increments number of atomics to this address
(numAtomics = 2) that are pending in TBE, and does a write through of the
atomic to the directory.
3. When the first atomic returns from the Directory, it decrements the
numAtomics counter. numAtomics was at 2 though, because of step #2. So
it doesn't deallocate the TBE entry and calls Event:AtomicNotDone.
4. Another request (a LD) to the same address comes along for the same
address. The LD does z_stall since the second atomic is pending –- so the
LD retries every cycle until the deadlock counter times out (or until the
second atomic comes back).
5. The second atomic returns to the TCC. However, because there are so
many LD's pending in the cache, all doing z_stall's and retrying every cycle,
there are a lot of resource stalls. So, when the second atomic returns, it is
forced to retry its operation multiple times -- and each time it decrements
the atomicDoneCnt flag (which was added to catch a race between atomics
arriving and leaving the TCC in 7246f70bfb) repeatedly. As a result
atomicDoneCnt becomes negative.
6. Since this atomicDoneCnt flag is used to determine when Event:AtomicDone
happens, and since the resource stalls caused the atomicDoneCnt flag to become
negative, we never complete the atomic. Which means the pending LD can never
access the line, because it's stuck waiting for the atomic to complete.
7. Eventually the deadlock threshold is reached.
To fix this issue, this commit changes the VIPER TCC protocol from using
z_stall to using the stall_and_wait buffer method that the
Directory-level of the SLICC already uses. This change effectively
prevents resource stalls from dominating the TCC level, by putting
pending requests for a given address in a per-address stall buffer.
These requests are then woken up when the pending request returns.
As part of this change, this change also makes two small changes to the
Directory-level protocol (MOESI_AMD_BASE-dir):
1. Updated the names of the wakeup actions to match the TCC wakeup actions,
to avoid confusion.
2. Changed transition(B, UnblockWriteThrough, U) to check all stall buffers,
as some requests were being placed later in the stall buffer than was
being checked. This mirrors the changes in 187c44fe44 to other Directory
transitions to resolve races between GPU and DMA requests, but for
transitions prior workloads did not stress.
Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51367
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
The GPU VIPER TCC protocol accidentally used "TiggerMsg" instead
of "TriggerMsg" for the triggerQueue_in port. This was a benign
bug beacuse the msg type is not used in the in_port implementation
but still makes the SLICC harder to understand, so fixing it is
worthwhile.
Change-Id: I88cbc72bac93bcc58a66f057a32f7bddf821cac9
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/44905
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
There is a race condition in VIPER where an atomic issued to the same
address can occur resulting in multiple trigger messages signalling the
compleition of the atomic operation. The first message was deallocating
the TBE causing the second message to dereference a nullptr when looking
up the TBE.
A counter is added to track the number of in flight AtomicDone trigger
messages. The AtomicDone is not called until the last in flight message
arrives at the trigger queue. The remaining messages call AtomicNotDone
which simply pops the message from the queue and keeps the TBE
allocated.
Change-Id: Ie1de0436861a7c393ad6d2fb2faceb83c18d4cc3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/39175
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>