The GPU device currently supports large BAR which means that the driver
can write directly to GPU memory over the PCI bus without using SDMA or
PM4 packets. The gem5 PCI interface only provides an atomic interface
for BAR reads/writes, which means the values cannot go through timing
mode Ruby caches. This causes bugs as the TCC cache is allowed to keep
clean data between kernels for performance reasons. If there is a BAR
write directly to memory bypassing the cache, the value in the cache is
stale and must be invalidated.
In this commit a TCC invalidate is generated for all writes over PCI
that go directly to GPU memory. This will also invalidate TCP along the
way if necessary. This currently relies on the driver synchonization
which only allows BAR writes in between kernels. Therefore, the cache
should only be in I or V state.
To handle a race condition between invalidates and launching the next
kernel, the invalidates return a response and the GPU command processor
will wait for all TCC invalidates to be complete before launching the
next kernel.
This fixes issues with stale data in nanoGPT and possibly PENNANT.
Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another. The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$). If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned. At that point they can be serviced, and assuming the line
has not been evicted, they should hit.
For example, in the following test (on 1 GPU thread, in 1 WG):
load Arr[0]
load Arr[1]
load Arr[2]
The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.
However, the current support in the VIPER SQC/TCP code does not model
this correctly. Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced. This causes all 3 requests to be
classified as misses.
To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from being
sent in parallel with the original one. To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line. If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.
As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).
As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests). This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.
Although the SQC does not support stores, the TCP does. Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).
Additionally, making these changes introduced issues with deadlocks at
the TCC. Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.
After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed. This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to). Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores. Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request. Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.
However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel. Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.
Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
The current GPU TCP (L1D$) Ruby SLICC code had a bug where a GPU
load that wants to bypass the L1D$ (e.g., GLC or SLC bit was set)
but the line is in Invalid when that request arrives, results in
a non-bypassing load being sent to the GPU TCC (L2$) instead of
a bypassing load.
This issue was not caught by currently nightly or weekly tests,
because the tests do not test for correctness in terms of hits
and misses in the caches. However, tests for these corner cases
expose this issue.
To fix, this, this patch removes the check that the entry is valid
when deciding what to do with a bypassing GPU load -- since the
TCP Ruby code has transitions for bypassing loads in both I and V,
we can simply call the LoadBypassEvict event in both cases and the
appropriate transition will handle the bypassing load given the
cache line's current state in the TCP.
Change-Id: Ia224cefdf56b4318b2bcbd0bed995fc8d3b62a14
Augmenting Datablock and WriteMask to support optional arg to
distinguish between return and no return. In the case of atomic no
return requests, log entries should not be created when performing the
atomic.
Change-Id: Ic3112834742f4058a7aa155d25ccc4c014b60199a
mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting
Fix several not properly indented prints and extraneous extra lines in
the SLICC code for the GPU SQC (L1I$) and TCP (L1D$).
This commit adds flush support to the GPU VIPER coherence protocol. The
L1 cache will now initiate a flush request if the packet it receives
is of type RubyRequestType_FLUSH. During the flush process, the L1 cache
will a request to L2 if its in either V or I state. L2 will issue a
flush request to the directory if its cache line is in the valid
state before invalidating its copy. The directory, on receiving this
request, writes data to memory and sends an ack back to the L2. L2
forwards this ack back to the L1, which then ends the flush by calling
the write callback
Change-Id: I9dfc0c7b71a1e9f6d5e9e6ed4977c1e6a3b5ba46
66d4a158 added support for AMD's GPU cache bypassing flags (GLC
for bypassing L1 caches, SLC for bypassing all caches). However,
it did not add a transition for the situation where the cache line
is currently I (Invalid). This commit adds this support, which
resolves an assert failure in Pannotia workloads when this situation
arises.
Change-Id: I59a62ce70c01dd8b73aacb733fb3d1d0dab2624b
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67201
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
66d4a158 added support for AMD's GPU cache bypassing flags (GLC
for bypassing L1 caches, SLC for bypassing all caches). However,
for applications that use the GLC flag but intermix GLC- and
non-GLC accesses to the same address, this previous commit
has a bug. This bug manifests when the address is currently
valid in the L1 (TCP). In this case, the previous commit chose
to evict the line before letting the bypassing access to proceed.
However, to do this the previous commit was using the inv_invDone
action as part of the process of evicting it. This action is only
intended to be called when load acquires are being performed
(i.e., when the entire L1 cache is being flash invalidated). Thus,
calling inv_invDone for a GLC (or SLC) bypassing request caused an
assert failure since the bypassing request was not performing a
load acquire.
This commit resolves this by changing the support in this case to
simply invalidate the entry in the cache.
Change-Id: Ibaa4976f8714ac93650020af1c0ce2b6732c95a2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67199
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
The GPU cache models do not support cache bypassing when the GLC or SLC
AMDGPU instruction modifiers are used in a load or store. This commit
adds cache bypass support by introducing new transitions in the
coherence protocol used by the GPU memory system. Now, instructions with
the GLC bit set will not cache in the L1 and instructions with SLC bit
set will not cache in L1 or L2.
Change-Id: Id29a47b0fa7e16a21a7718949db802f85e9897c3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66991
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Previously, the GPU SQC and TCP Ruby protocols always told the Sequencer
that the externalHit field was false. This impacts the statistics and
profiling, because the Sequencer uses this hit/miss information both for
profiling and the coalescer's statistics.
To resolve this, this commit updates the GPU SQC and TCP Ruby protocols
to pass the appropriate hit/miss information into the Sequencer's
readCallback and hitCallback functions.
Change-Id: Ib74af09b66fa8866eee72d3a9ab0e8a8f2196c03
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/60652
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Remove the line "For use for simulation and test purposes only" in files
were AMD is the only copyright holder listed in the header. This happens
to be the case for all files where this line exists, removing it
completely from gem5.
Change-Id: I623f266b002f564301b28774f49081099cfc60fd
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53943
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>