Commit Graph

39 Commits

Author SHA1 Message Date
Jarvis Jia
e957a882ed gpu-compute,mem-ruby: Add RubyHitMiss flag for TCP and TCC cache
Add hit and miss print for TCP and TCC cache with RubyHitMiss debug flag

Change-Id: I40ae3449020b917f39ac91d29fa4e1dd7c791e7b
2024-06-24 11:19:51 -07:00
Bobby R. Bruce
3138c8a8b1 gpu-compute,mem-ruby: Revert "Add RubyHitMiss flag for TCP and TCC cache" (#1254)
Reverts gem5/gem5#1226
2024-06-18 07:58:54 -07:00
Jarvis Jia
3a2bf47d57 Add default value and change Ruby address format specifier
Change-Id: I8fbaf34745e90589e610d3b9bd423937e7ebdc3d
2024-06-17 03:27:25 -05:00
Jarvis Jia
87c0d7732c Merge branch 'develop' into rubyhitmiss 2024-06-12 17:30:35 -04:00
Jarvis Jia
edfc139c40 Change black format
Change-Id: I3733b31baf187e0d3d38d971d9423a1b1afe2296

gpu-compute: add GPU RubyHitMiss for TCP and TCC

Change-Id: I4430532b901811e03d9b077b61e2eca4557b34e1

gpu-compute: Add RubyHitMiss flag for TCP and TCC cache

Change-Id: I4e5d1127c84b9eb1060ec9ba0b6638267449eda5

gpu-compute: Add RubyHitMiss flag for TCP and TCC cache

Change-Id: I4e5d1127c84b9eb1060ec9ba0b6638267449eda5

Remove space

Change-Id: I401f528c6f128ba0956bdbc232e8f2ae37bf648c
2024-06-12 16:04:36 -05:00
Vishnu Ramadas
42b9a9666e mem-ruby: Add instSeqNum to atomic responses from GPU L2 caches
This commit adds instSeqNum to the atomic responses in
GPU_VIPER-TCC.sm. This will be useful when debugging issues related to
GPU atomic transactions

Change-Id: Ic05c8e1a1cb230abfca2759b51e5603304aadaa3
2024-06-11 20:35:43 -05:00
Vishnu Ramadas
943d1f1453 mem-ruby: Fix deadlock in GPU_VIPER when issuing atomic requests
When a compute unit issues several requests to the same line,
the requests wait in the L2 if it is a writeback cache. If the line is
invalid initially and the first request is atomic in nature, the L2
cache issues a request to main memory. On data return, the cache line
transitions to M but doesn't wake up the other requests, resulting in
a deadlock. This commit adds a wakeup call on data return for atomics
and fixes potential deadlocks.

Change-Id: I8200ce6e77da7c8b4db285c0cc8b8ca0dfa7d720
2024-06-11 20:33:46 -05:00
Matthew Poremba
1d64669473 mem,gpu-compute: Implement GPU TCC directed invalidate
The GPU device currently supports large BAR which means that the driver
can write directly to GPU memory over the PCI bus without using SDMA or
PM4 packets. The gem5 PCI interface only provides an atomic interface
for BAR reads/writes, which means the values cannot go through timing
mode Ruby caches. This causes bugs as the TCC cache is allowed to keep
clean data between kernels for performance reasons. If there is a BAR
write directly to memory bypassing the cache, the value in the cache is
stale and must be invalidated.

In this commit a TCC invalidate is generated for all writes over PCI
that go directly to GPU memory. This will also invalidate TCP along the
way if necessary. This currently relies on the driver synchonization
which only allows BAR writes in between kernels. Therefore, the cache
should only be in I or V state.

To handle a race condition between invalidates and launching the next
kernel, the invalidates return a response and the GPU command processor
will wait for all TCC invalidates to be complete before launching the
next kernel.

This fixes issues with stale data in nanoGPT and possibly PENNANT.

Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf
2024-04-10 11:35:25 -07:00
Matt Sinclair
777ac91bb0 mem-ruby: Add categorization of bypassed atomics in TCC (#899)
Adds categorization of bypassed atomics in TCC to the TBE as either
return or no-return, which gets consumed in pa_performAtomic to
determine if atomic logs should be stored.

Reestablishes TCC bypassed atomics after #546.

Change-Id: Ibc1fa2b795ef1c47c3893a0b1911fa7993522d38
2024-02-28 14:26:09 -06:00
Daniel Kouchekinia
de615836f0 mem-ruby: Add categorization of bypassed atomics in TCC
Adds categorization of bypassed atomics in TCC to the TBE as either return
or no-return, which gets consumed in pa_performAtomic to determine if
atomic logs should be stored.

Reestablishes TCC bypassed atomics after #546.

Change-Id: Ibc1fa2b795ef1c47c3893a0b1911fa7993522d38
2024-02-27 23:12:45 -06:00
Daniel Kouchekinia
6374697a20 mem-ruby: Add missing transition for SLC writes to VIPER TCC
Bypassed write though requests on invalid lines in the TCC should be
written though to the directory. This transition was previously
missing.

Change-Id: I16b117c4e085ce6be0ed5297aa0129d52cd35a51
2024-02-26 13:13:06 -06:00
Matthew Poremba
6e433ed885 mem-ruby: Fixes for new AtomicWait event in VIPER TCC (#585)
The AtomicWait event was not being woken up properly due to the
numPending count in the TBE not being decremented. This patch decrements
the count when Data is returned. Since that moves to a base state, the
TBE should no longer be needed.

Additionally added a transition which stalls and wait when an AtomicWait
occurs while in WI state so that it retries.

Change-Id: Ic8bfc700f9df3f95bea0799121898926a23d8163
2023-11-22 14:05:43 -08:00
Matt Sinclair
c3326c78e6 mem-ruby, gpu-compute: fix SQC/TCP requests to same line
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another.  The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$).  If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned.  At that point they can be serviced, and assuming the line
has not been evicted, they should hit.

For example, in the following test (on 1 GPU thread, in 1 WG):

load Arr[0]
load Arr[1]
load Arr[2]

The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.

However, the current support in the VIPER SQC/TCP code does not model
this correctly.  Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced.  This causes all 3 requests to be
classified as misses.

To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from being
sent in parallel with the original one.  To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line.  If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP.  Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.

As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).

As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests).  This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.

Although the SQC does not support stores, the TCP does.  Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).

Additionally, making these changes introduced issues with deadlocks at
the TCC.  Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.

After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed.  This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to).  Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores.  Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request.  Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.

However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel.  Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.

Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-15 19:23:51 -06:00
BujSet
65b44e6516 mem-ruby: Fix for not creating log entries on atomic no return requests (#546)
Augmenting Datablock and WriteMask to support optional arg to
distinguish between return and no return. In the case of atomic no
return requests, log entries should not be created when performing the
atomic.

Change-Id: Ic3112834742f4058a7aa155d25ccc4c014b60199a
2023-11-14 07:54:42 -08:00
Daniel Kouchekinia
be5c03ea9f mem-ruby,configs: Add GPU GLC Atomic Resource Constraints (#120)
Added a resource constraint, AtomicALUOperation, to GLC atomics
performed in the TCC.

The resource constraint uses a new class, ALUFreeList array. The class
assumes the following:
  - There are a fixed number of atomic ALU pipelines
- While a new cache line can be processed in each pipeline each cycle,
if a cache line is currently going through a pipeline, it can't be
processed again until it's finished

Two configuration parameters have been used to tune this behavior:
- tcc-num-atomic-alus corresponds to the number of atomic ALU pipelines
- atomic-alu-latency corresponds to the latency of atomic ALU pipelines

Change-Id: I25bdde7dafc3877590bb6536efdf57b8c540a939
2023-11-14 07:48:48 -08:00
Matt Sinclair
48fde5a9c6 mem-ruby, gpu-compute: fix formatting of TCC (#536)
mem-ruby, gpu-compute: fix formatting of TCC

Fix several not properly indented prints and extraneous extra lines in
the SLICC code for the GPU TCC (L2 cache).
2023-11-13 15:01:30 -08:00
Daniel Kouchekinia
1204267fd8 mem-ruby: SLICC Fixes to GLC Atomics in WB L2 (#397)
Made the following changes to fix the behavior of GLC atomics in a WB
L2:
- Stored atomic write mask in TBE For GLC atomics on an invalid line
that bypass to the directory, but have their atomics performed on the
return path.
- Replaced !presentOrAvail() check for bypassing atomics to directory
(which will then be performed on return path), with check for invalid
line state.
- Replaced wdb_writeDirtyBytes action used when performing atomics with
owm_orWriteMask action that doesn't write from invalid atomic request
data block
   - Fixed atomic return path actions

Change-Id: I6a406c313d2f9c88cd75bfe39187ef94ce84098f
2023-11-09 13:15:10 -08:00
Daniel Kouchekinia
4931fb0010 mem-ruby: Always pass on GPU atomics to dir in write-through TCC (#367)
Added checks to ensure that atomics are not performed in the TCC when it
is configured as a write-through cache. Also added SLC bit overwrite to
ensure directory preforms atomics when there is a write-through TCC.

Change-Id: I4514e6c8022aeb7785f2c59871cd9acec8161ed8
2023-10-14 06:39:50 -07:00
Vishnu Ramadas
085789d00c mem-ruby: Add flush support to GPU_VIPER protocol
This commit adds flush support to the GPU VIPER coherence protocol. The
L1 cache will now initiate a flush request if the packet it receives
is of type RubyRequestType_FLUSH. During the flush process, the L1 cache
will a request to L2 if its in either V or I state. L2 will issue a
flush request to the directory if its cache line is in the valid
state before invalidating its copy. The directory, on receiving this
request, writes data to memory and sends an ack back to the L2. L2
forwards this ack back to the L1, which then ends the flush by calling
the write callback

Change-Id: I9dfc0c7b71a1e9f6d5e9e6ed4977c1e6a3b5ba46
2023-10-02 19:05:10 -05:00
Ranganath (Bujji) Selagamsetty
f6a453362f mem: Atomic ops to same address
Augmenting the DataBlock class with a change log structure to
record the effects of atomic operations on a data block and
service these changes if the atomic operations require return
values.

Although the operations are atomic, the coalescer need not
send unique memory requests for each operation. Atomic
operations within a wavefront to the same address are now
coalesced into a single memory request. The response of this
request carries all the necessary information to provide the
requesting lanes unique values as a result of their individual
atomic operations. This helps reduce contention for request
and response queues in simulation.

Previously, only the final value of the datablock after all
atomic ops to the same address was visible to the requesting
waves. This change corrects this behavior by allowing each wave
to see the effect of this individual atomic op is a return value
is necessary.

Change-Id: I639bea943afd317e45f8fa3bff7689f6b8df9395
2023-08-23 14:45:25 -05:00
Daniel Kouchekinia
984499329d mem-ruby,configs: Add GLC Atomic Latency VIPER Parameter (#110)
Added a GLC atomic latency parameter (glc-atomic-latency) used when
enqueueing response messages regarding atomics directly performed in
the TCC. This latency is added in addition to the L2 response latency
(TCC_latency). This represents the latency of performing an atomic
within the L2.

With this change, the TCC response queue will receive enqueues with
varying latencies as GLC atomic responses will have this added GLC
atomic latency while data responses will not. To accommodate this in
light of the queue having strict FIFO ordering (which would be violated
here), this change also adds an optional parameter bypassStrictFIFO to
the SLICC enqueue function which allows overriding strict FIFO
requirements for individual messages on a case-by-case basis. This
parameter is only being used in the TCC's atomic response enqueue call.

Change-Id: Iabd52cbd2c0cc385c1fb3fe7bcd0cc64bdb40aac
2023-07-23 15:57:06 -05:00
Daniel Kouchekinia
1705853b12 mem-ruby: Added support for non-system-scope atomics in VIPER (#101)
Added support for performing non-SLC-set atomics in the TCC.
Previously, all atomics were being passed on by the TCC to the
directory. With this change, atomics will only be passed on if the SLC
bit is set or if the line isn't present or available in the TCC.

If a non-SLC atomic is passed on to the directory because it is not
present in the TCC, the atomic will be performed on the return path on
the Data event. To accommodate the directory not performing the atomic
in this case, this change also passes the SLC bit on to the directory.

The previously-named "Atomic" action has been renamed to
"AtomicPassOn", with the new "Atomic" corresponding to an atomic
performed directly in the TCC.

Change-Id: Ibf92f71ddceb38bd1b0da70b0a786cc4c3cf2669
2023-07-20 11:48:08 -05:00
Daniel Kouchekinia
f8f5dd98bf mem-ruby: Added WIB State to VIPER TCC Cache (#67)
Added WIB (Waiting on Writethrough Ack; Will be Bypassed) state which
is transitioned to when a dirty line in the TCC is evicted in a
bypassed read. Previously, we were transitioning to invalid.

While a WI (Waiting on Writethrough Ack) state exists, transitions from
it on WBAck deallocates the TBE, which contains SLC bit information
needed to trigger the Bypass event when the read response from the
directory comes in.

Without this change, WB acknowledgements from the directory in read
bypass evicts (with the SLC bit set) were being treated as if they were
read responses, leading to an invalid transition panic.

Change-Id: I703c3fe8af0366856552bb677810cb1a8f2896de
2023-07-17 10:17:47 -07:00
Matt Sinclair
a030ff2745 mem-ruby: fix atomic deadlock with WB GPU L2 caches
By default the GPU VIPER coherence protocol uses a WT L2 cache.
However it has support for using WB caches (although this is not
tested currently).  When using a WB L2 cache for the GPU, this
results in deadlocks with atomics.

Specifically, when an atomic reaches the L2 and the line is
currently in M or W, the line must be written back before the atomic
can be performed.  However, the current support has two issues:

a) it never performs the atomic operation -- while VIPER current
assumes all atomics are system scope atomics and thus cannot be
performed at the L2 and this transition requires the dirty line be
written back before performing the atomic, the transition never
performs the atomic nor does the response path handle it.
b) putting the atomic action right after the write back is not
safe because we need to ensure the requests are ordered when they
reach memory -- thus we have to wait until the write back is
acknowledged before it's safe to send/perform the atomic.

To fix this, this change modifies the transition in question to
put the atomic on the stalled requests buffer, which the WBAck will
check when it returns to the L2 (and thus perform the atomic, which
will result in the atomic being sent on to the directory).

This fix has been tested and verified with both the per-checkin and
nightly GPU Ruby Random tester tests (with a WB L2 cache).

Change-Id: I9a43fd985dc71297521f4b05c47288d92c314ac7
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/68978
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-03-22 04:00:38 +00:00
Matt Sinclair
92d920f994 mem-ruby: fix load deadlock with WB GPU L2 caches
By default the GPU VIPER coherence protocol uses a WT L2 cache.
However it has support for using WB caches (although this is not
tested currently).  When using a WB L2 cache for the GPU, this
results in deadlocks with loads.

Specifically, when a load reaches the L2 and the line is currently
in the W state, that line must be written back before the load can
be performed.  However, the current transition for this in the L2
did not attempt to retry the load when the WB completes, resulting
in a deadlock.  This deadlock can be replicated by running the GPU
Ruby random tester as is with a WB L2 cache instead of a WT L2
cache.

To fix this, this change modifies the transition in question to
put the load on the stalled requests buffer, which the WBAck will
check when it returns to the L2 (and thus perform the load).

This fix has been tested and verified with both the per-checkin and
nightly GPU Ruby Random tester tests (with a WB L2 cache).

Change-Id: Ieec4f61a3070cf9976b8c3ef0cdbd0cc5a1443c6
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/68977
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-03-22 04:00:38 +00:00
Vishnu Ramadas
c23d7bb3ee gpu-compute, mem-ruby: Add p_popRequestQueue to some transitions
Two W->WI transitions, on events RdBlk and Atomic in the GPU L2 cache
coherence protocol do not clear  the request from the request queue upon
completing the transition. This action is not performed in the respone
path. This update adds the p_popRequestQueue action to each of these
transitions to remove the stale request from the queue.

Change-Id: Ia2679fe3dd702f4df2bc114f4607ba40c18d6ff1
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67192
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-01-05 23:41:00 +00:00
Vishnu Ramadas
ddf43726ef gpu-compute, mem-ruby: Update GPU cache bypassing to use TBE
An earlier commit added support for GLC and SLC AMDGPU instruction
modifiers. These modifiers enable cache bypassing when set. The GLC/SLC
flag information was being threaded through all the way to memory and
back so that appropriate actions could be taken upon receiving a request
and corresponding response. This commit removes the threading and adds
the bypass flag information to TBE. Requests populate this
entry and responses access it to determine the correct set of actions to
execute.

Change-Id: I20ffa6682d109270adb921de078cfd47fb4e137c
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67191
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
2023-01-05 23:38:32 +00:00
Vishnu Ramadas
66d4a15820 gpu-compute,mem-ruby: Add support for GPU cache bypassing
The GPU cache models do not support cache bypassing when the GLC or SLC
AMDGPU instruction modifiers are used in a load or store. This commit
adds cache bypass support by introducing new transitions in the
coherence protocol used by the GPU memory system. Now, instructions with
the GLC bit set will not cache in the L1 and instructions with SLC bit
set will not cache in L1 or L2.

Change-Id: Id29a47b0fa7e16a21a7718949db802f85e9897c3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66991
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
2023-01-03 21:19:24 +00:00
Kyle Roarty
f876e60bc2 mem-ruby: Fix deadlock in GPU VIPER TCC
A deadlock occured where we got a RdBlk while in W,
which put us in WI while we wait for a writeback to complete.

This would cause the request to be stalled while the writeback
was occuring, but when the writeback completed (WBAck), we never
woke up the requests and thus never completed the RdBlk.

This commit adds a wakeup when we receive a WBAck while in WI.

Change-Id: I01edf1d7a47757b4f680baf9f33a1a6aa37e7e25
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/59352
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-06-06 18:28:52 +00:00
Matthew Poremba
9313294efe misc: Remove AMD license addition
Remove the line "For use for simulation and test purposes only" in files
were AMD is the only copyright holder listed in the header. This happens
to be the case for all files where this line exists, removing it
completely from gem5.

Change-Id: I623f266b002f564301b28774f49081099cfc60fd
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53943
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-12-11 04:00:56 +00:00
Matt Sinclair
118677218d mem-ruby: fix typo in GPU VIPER TCC comment
72ee6d1a fixed a deadlock in the GPU VIPER TCC.  However, it
inadvertently added a typo to the comments explaining the change.  This
commit fixes that.

Change-Id: Ibba835aa907be33fc3dd8e576ad2901d5f8f509c
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51687
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-10-17 04:07:49 +00:00
Matt Sinclair
1120931105 mem-ruby: Move VIPER TCC decrements to action from in_port
Currently, the GPU VIPER TCC protocol handles races between atomics in
the triggerQueue_in.  This in_port does not check for resource
availability, which can cause the trigger queue to execute multiple
times.  Although this is the expected behavior, the code for handling
atomic races decrements the atomicDoneCnt flag in the trigger queue,
which is not safe since resource contention may cause it to execute
multiple times.

To resolve this issue, this commit moves the decrementing of this
counter to a new action that is called in an event that happens only
when the race between atomics is detected.

Change-Id: I552fd4f34fdd9ebeec99fb7aeb4eeb7b150f577f
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51368
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-10-08 22:03:13 +00:00
Matt Sinclair
72ee6d1aad mem-ruby: Update GPU VIPER TCC protocol to resolve deadlock
In the GPU VIPER TCC, programs with mixes of atomics and data
accesses to the same address, in the same kernel, can experience
deadlock when large applications (e.g., Pannotia's graph analytics
algorithms) are running on very small GPUs (e.g., the default 4 CU GPU
configuration).  In this situation, deadlocks occur due to resource
stalls interacting with the behavior of the current implementation for
handling races between atomic accesses.  The specific order of events
causing this deadlock are:

1. TCC is waiting on an atomic to return from directory

2. In the meantime it receives another atomic to the same address -- when
this happens, the TCC increments number of atomics to this address
(numAtomics = 2) that are pending in TBE, and does a write through of the
atomic to the directory.

3. When the first atomic returns from the Directory, it decrements the
numAtomics counter.  numAtomics was at 2 though, because of step #2.  So
it doesn't deallocate the TBE entry and calls Event:AtomicNotDone.

4. Another request (a LD) to the same address comes along for the same
address.  The LD does z_stall since the second atomic is pending –- so the
LD retries every cycle until the deadlock counter times out (or until the
second atomic comes back).

5.  The second atomic returns to the TCC.  However, because there are so
many LD's pending in the cache, all doing z_stall's and retrying every cycle,
there are a lot of resource stalls.  So, when the second atomic returns, it is
forced to retry its operation multiple times -- and each time it decrements
the atomicDoneCnt flag (which was added to catch a race between atomics
arriving and leaving the TCC in 7246f70bfb) repeatedly.  As a result
atomicDoneCnt becomes negative.

6.  Since this atomicDoneCnt flag is used to determine when Event:AtomicDone
happens, and since the resource stalls caused the atomicDoneCnt flag to become
negative, we never complete the atomic.  Which means the pending LD can never
access the line, because it's stuck waiting for the atomic to complete.

7.  Eventually the deadlock threshold is reached.

To fix this issue, this commit changes the VIPER TCC protocol from using
z_stall to using the stall_and_wait buffer method that the
Directory-level of the SLICC already uses.  This change effectively
prevents resource stalls from dominating the TCC level, by putting
pending requests for a given address in a per-address stall buffer.
These requests are then woken up when the pending request returns.

As part of this change, this change also makes two small changes to the
Directory-level protocol (MOESI_AMD_BASE-dir):

1.  Updated the names of the wakeup actions to match the TCC wakeup actions,
to avoid confusion.

2.  Changed transition(B, UnblockWriteThrough, U) to check all stall buffers,
as some requests were being placed later in the stall buffer than was
being checked.  This mirrors the changes in 187c44fe44 to other Directory
transitions to resolve races between GPU and DMA requests, but for
transitions prior workloads did not stress.

Change-Id: I60ac9830a87c125e9ac49515a7fc7731a65723c2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51367
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-10-08 22:03:13 +00:00
Matt Sinclair
0eef1069cb ruby: fix typo in VIPER TCC triggerQueue
The GPU VIPER TCC protocol accidentally used "TiggerMsg" instead
of "TriggerMsg" for the triggerQueue_in port.  This was a benign
bug beacuse the msg type is not used in the in_port implementation
but still makes the SLICC harder to understand, so fixing it is
worthwhile.

Change-Id: I88cbc72bac93bcc58a66f057a32f7bddf821cac9
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/44905
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-04-28 01:25:45 +00:00
Matthew Poremba
7246f70bfb mem-ruby: Fix race related to atomics in VIPER
There is a race condition in VIPER where an atomic issued to the same
address can occur resulting in multiple trigger messages signalling the
compleition of the atomic operation. The first message was deallocating
the TBE causing the second message to dereference a nullptr when looking
up the TBE.

A counter is added to track the number of in flight AtomicDone trigger
messages. The AtomicDone is not called until the last in flight message
arrives at the trigger queue. The remaining messages call AtomicNotDone
which simply pops the message from the queue and keeps the TBE
allocated.

Change-Id: Ie1de0436861a7c393ad6d2fb2faceb83c18d4cc3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/39175
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2021-01-15 17:46:38 +00:00
Hoa Nguyen
4c42811ff3 mem-ruby: Move CacheMemory stats used in SLICC to a Stats group
This change moves some stats that are used in SLICC to a
separate Stats::Group.

In order to use stats in SLICC, new functions are added in
CacheMemory:
        - profileDemandHit()
        - profileDemandMiss()

The functions increase the corresponding stat by 1.

Change-Id: I52b6fefdf6579a49f626f2fca400641f90800017
Signed-off-by: Hoa Nguyen <hoanguyen@ucdavis.edu>
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/37815
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Tiago Mück <tiago.muck@arm.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
2020-12-22 09:52:36 +00:00
Kyle Roarty
1339a1b080 mem-ruby: add cache hit/miss statistics for TCP and TCC
Change-Id: Ifa6fdbb9dd062a3684b9620eac6683c57e651a72
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/30174
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Bradford Beckmann <brad.beckmann@amd.com>
Maintainer: Bradford Beckmann <brad.beckmann@amd.com>
2020-06-20 04:20:45 +00:00
Tuan Ta
18ebe62598 mem-ruby: GCN3 and VIPER integration
This patch modifies the Coalescer and VIPER protocol to support memory
synchronization requests and write-completion responses that are
required by upcoming GCN3 implementation.

VIPER protocol is simplified to be a solely write-through protocol.

Change-Id: Iccfa3d749a0301172a1cc567c59609bb548dace6
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/29913
Reviewed-by: Anthony Gutierrez <anthony.gutierrez@amd.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Bradford Beckmann <brad.beckmann@amd.com>
Maintainer: Anthony Gutierrez <anthony.gutierrez@amd.com>
Maintainer: Bradford Beckmann <brad.beckmann@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2020-06-19 20:32:54 +00:00
Gabe Black
c08351f4d3 mem: Move ruby protocols into a directory called ruby_protocol.
Now that the gem5 protocols are split out, it would be nice to put them
in their own protocol directory. It's also confusing to have files
called *_protocol which are not in the protocol directory.

Change-Id: I7475ee111630050a2421816dfd290921baab9f71
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/20230
Reviewed-by: Gabe Black <gabeblack@google.com>
Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-by: Jason Lowe-Power <jason@lowepower.com>
Maintainer: Gabe Black <gabeblack@google.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2019-08-23 21:13:07 +00:00