Commit Graph

1176 Commits

Author SHA1 Message Date
Giacomo Travaglini
36c1ea9c61 mem-ruby: Implement MakeReadUnique in CHI (#1101)
Change-Id: I64cd3c62804cca184d68287fc099534e9205f2b8
Reviewed-by: Tiago Muck <tiago.muck@arm.com>

Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2024-05-06 08:30:59 +02:00
Nicholas Mosier
66decb2e93 mem-ruby: Fix functional reads for MESI Three-Level messages (#1045)
Fix #1044. This patch adds checks for message types (PUTX_COPY, DATA,
DATA_EXCLUSIVE) that contain data blocks but were missing from the
original `functionalRead` method in MESI Three-Level messages.

Change-Id: I0cedc314166c9cc037bf20f5b7fef5552dd1253c
2024-04-25 11:14:37 -07:00
Ivana Mitrovic
42ffa52907 mem-ruby: Implement no_alloc Far Atomics in CHI (#994)
This PR introduces a missing pice of far atomic implementation. This
pull request incorporates several changes:

- Enable 2-level and 4-level (and N-level) cache hierarchies, removing
Atomic_NoWait transactions
- Fix Unique Near policy implementation that raised abort
- Add support for alloc_on_atomic == False. Enables Far Atomics on
systems where the HNF does not allocate evicted lines at LLC (Like in
WriteUpdate).
2024-04-18 11:35:47 -07:00
Matthew Poremba
1d64669473 mem,gpu-compute: Implement GPU TCC directed invalidate
The GPU device currently supports large BAR which means that the driver
can write directly to GPU memory over the PCI bus without using SDMA or
PM4 packets. The gem5 PCI interface only provides an atomic interface
for BAR reads/writes, which means the values cannot go through timing
mode Ruby caches. This causes bugs as the TCC cache is allowed to keep
clean data between kernels for performance reasons. If there is a BAR
write directly to memory bypassing the cache, the value in the cache is
stale and must be invalidated.

In this commit a TCC invalidate is generated for all writes over PCI
that go directly to GPU memory. This will also invalidate TCP along the
way if necessary. This currently relies on the driver synchonization
which only allows BAR writes in between kernels. Therefore, the cache
should only be in I or V state.

To handle a race condition between invalidates and launching the next
kernel, the invalidates return a response and the GPU command processor
will wait for all TCC invalidates to be complete before launching the
next kernel.

This fixes issues with stale data in nanoGPT and possibly PENNANT.

Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf
2024-04-10 11:35:25 -07:00
Matthew Poremba
833392e7b2 mem-ruby,gpu-compute: Allow memory reqs without inst
The GPUDynInst for sending memory requests through the CUs data port
is required but only used for DPRINTFs. Relax this constraint so that
the methods can be reused for requests such as probes generated by the
GPU device.

Change-Id: I16094e400968225596370b684d6471580888d98a
2024-04-10 11:35:24 -07:00
Víctor Soria Pardos
98358da968 mem-ruby: Implement Atomic No Alloc Policy
Add alternative implementation to far atomics when the flag alloc_on_commit
is false. The implementation fetches the data, performs the atomic and
writes back the cache line to main memory.

Co-authored-by: Fabian Schätzle <f.schaetzle@fz-juelich.de>
Change-Id: I8797fbc68448e1866a292f4afeedd3613113dddd
2024-04-06 18:51:11 +02:00
Víctor Soria Pardos
5a6a3be6da mem-ruby: Fix policy_type condition in CHI
Fix if-else condition in CHI-cache-actions to correctly
support policy_type Present Near (2)

Change-Id: Ib776d847a908a8ac7693c2d10405bc0c4a9d767d
2024-04-04 10:55:56 +02:00
Víctor Soria Pardos
7ee574b309 mem-ruby: Remove AtomicReturn_NoWait from CHI
To make Atomic transaction recursive and enable 2-level config,
remove AtomicReturn_NoWait and other level-dependent code

GitHub Issue: https://github.com/gem5/gem5/issues/882

Change-Id: Iac468cdb8a3b5914c8f05c5cedde866ce85f359a
2024-04-04 10:54:42 +02:00
Minje Jun
ffd0680a2c mem-ruby: Copyback UD_RU line when evicted in CHI protocol (#945)
This is a followed up fix to #791 mem-ruby: Fix possible dirty line loss
in CHI when ReadShared hit on UD line.
UD_RU line may have stale data since the upstream could have updated the
line, so its local cache line data is treated as invalid
(dataValid=false). But when the line is evicted, it must be written back
to downstream because the upstream may have the line in clean state
(UC). This change fixes it by performing copy back the UD_RU line while
keeping its dataValid as false.

Example error case:
- L3 was in UD_RSC and being evicted without back-invalidation. LLC (HN)
was in RU state.
- Because there's still upstream sharer, L3 sends WriteClean.
- Because the data state was unique and dirty, L3 sends CBWrData_UD_PD.
- LLC becomes UD_RU.
- When the line is evicted from LLC (LocalHN_Eviction), the line is just
dropped, causing the loss of the dirty copy

Co-authored-by: Minje Jun <minje.jun@samsung.com>
2024-04-03 08:33:22 -07:00
Matt Sinclair
777ac91bb0 mem-ruby: Add categorization of bypassed atomics in TCC (#899)
Adds categorization of bypassed atomics in TCC to the TBE as either
return or no-return, which gets consumed in pa_performAtomic to
determine if atomic logs should be stored.

Reestablishes TCC bypassed atomics after #546.

Change-Id: Ibc1fa2b795ef1c47c3893a0b1911fa7993522d38
2024-02-28 14:26:09 -06:00
Daniel Kouchekinia
de615836f0 mem-ruby: Add categorization of bypassed atomics in TCC
Adds categorization of bypassed atomics in TCC to the TBE as either return
or no-return, which gets consumed in pa_performAtomic to determine if
atomic logs should be stored.

Reestablishes TCC bypassed atomics after #546.

Change-Id: Ibc1fa2b795ef1c47c3893a0b1911fa7993522d38
2024-02-27 23:12:45 -06:00
Daniel Kouchekinia
6374697a20 mem-ruby: Add missing transition for SLC writes to VIPER TCC
Bypassed write though requests on invalid lines in the TCC should be
written though to the directory. This transition was previously
missing.

Change-Id: I16b117c4e085ce6be0ed5297aa0129d52cd35a51
2024-02-26 13:13:06 -06:00
Ivana Mitrovic
61ee36eee6 mem-ruby: Fix possible dirty line loss in CHI when ReadShared hit on UD line (#791)
In case ReadShared hit on a UD line and there's no sharers, this chage
makes the downstream passes Dirty to the requestor whenever possible
even though it doesn't deallocate the line. This will make the requestor
to SD and the downstream to UD_RSD.
In the previous implementation, loosely exclusive intermediate cache can
cause loss of dirty data. Example error condition is as below.
   
Configurations
L2 cache: Roughly inclusive to L1 without back-invalidation
- dealloc_on_* = false
- dealloc_backinv_* = false
L3 cache: Roughly exclusive to L2 without back-invalidation
- alloc_on_readshared = tue
- alloc_on_readunique = false
- dealloc_on_shared = false
- dealloc_on_unique = true
- dealloc_backinv_* = false
- is_HN = false
LLC: Same clusivity as L3 except is_HN = true
For all caches, allow_SD = true and fwd_unique_on_readshared = false
    
Example problem sequence:
1. L1 sends ReadUnique then becomes UD. L2 is UC_RU. L3 and LLC are RU.
2. L1 evicts the line to L2 by WriteBackFull (UD_PD). L2 becomes UD.
3. L2 evicts the line to L3 using WriteBackFull (UD_PD). L3 becomes UD.
4. L1 reads the line with ReadShared which misses on L2.
5. L2 reads the line with ReadShared which hits on L3. L3 becomes UD_RSC
because it doesn't deallocate the line (dataToBeInvalid=false)
6. L3 evicts the line to LLC by WriteCleanFull (UD_PD) because L3
doesn't back-invalidate and still has sharer. The local cache line is
invalidated by Deallocate_CacheBlock. L3 becomes RUSC and LLC becomes
UD_RU.
7. When UD_RU is evicted at LLC, the UD_RU line is dropped expecting the
upstream to writeback, causing loss of dirty data
2024-02-26 10:06:17 -08:00
Vishnu Ramadas
690b2b9462 gpu-compute, mem-ruby: Add comments and reformat code
Change-Id: Id2b3886dce347fdcfcad22009a42b92febc00a6c
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
0e93e6142a arch-vega, gpu-compute, mem-ruby: Remove extra empty lines
Change-Id: I18770ec7e38c4a992a0ae6de95b0be49ab4426c2
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
23dc98ea72 mem-ruby: Add SQC cache invalidation support to GPU VIPER
This commit adds support for cache invalidation in GPU VIPER protocol's
SQC cache. To support this, the commit also adds L1 cache invalidation
framework in the Sequencer such that the Sequencer sends out an
invalidation request for each line in the cache and declares completion
once all lines are evicted.

Change-Id: I2f52eacabb2412b16f467f994e985c378230f841
2024-02-09 12:14:57 -06:00
Minje Jun
db5c71a919 mem-ruby: Pass UD on ReadShared hit only if SD is not allowed
This commit allows CompData_SD be sent when ReadShared hits on UD line and
the local cache keeps the line, unless the request doesn't allow SD.

Change-Id: I337f24c871cc4c19c5b5fb11f9b35c0a8eb7911c
2024-02-08 18:47:44 +09:00
Minje Jun
628be390a0 mem-ruby: Fix ReadShared hit handling on UD line
In case ReadShared hit on a UD line and there's no sharers, this chage
makes the downstream respond with Unique even though it doesn't deallocate
the line. This will make the requestor to UD and the downstream to UD_RU.
In the previous implementation, loosely exclusive intermediate cache can
cause loss of dirty data. Example sequence is as below.

Configurations
L2 cache: Roughly inclusive to L1 without back-invalidation
- dealloc_on_* = false
- dealloc_backinv_* = false
L3 cache: Roughly exclusive to L2 without back-invalidation
- alloc_on_readshared = tue
- alloc_on_readunique = false
- dealloc_on_shared = false
- dealloc_on_unique = true
- dealloc_backinv_* = false
- is_HN = false
LLC: Same clusivity as L3 except is_HN = true
For all caches, allow_SD = true and fwd_unique_on_readshared = false

Example problem sequence:
1. L1 sends ReadUnique then becomes UD. L2 is UC_RU. L3 and LLC are RU.
2. L1 evicts the line to L2 by WriteBackFull (UD_PD). L2 becomes UD.
3. L2 evicts the line to L3 using WriteBackFull (UD_PD). L3 becomes UD.
4. L1 reads the line with ReadShared which misses on L2.
5. L2 reads the line with ReadShared which hits on L3. L3 becomes UD_RSC
   because it doesn't deallocate the line (dataToBeInvalid=false)
6. L3 evicts the line to LLC by WriteCleanFull (UD_PD) because L3 doesn't
   back-invalidate and still has sharer. The local cache line is
   invalidated by Deallocate_CacheBlock.
   L3 becomes RUSC and LLC becomes UD_RU.
7. When UD_RU is evicted at LLC, the UD_RU line is dropped expecting the
   upstream to writeback, causing loss of dirty data.

Change-Id: Ic9bee27f2ec8906dd5df8bd3be60e5a9a76c782f
2024-02-08 18:47:44 +09:00
Minje Jun
1b5d92ee9c mem-ruby: Revert Writeback CHI UD_RU line at local evict
This reverts commit d613d814a431525e122552a667eed653a057f2be.

Change-Id: I50e218b7debf3a2836ce12515d8fcb6c0b38df53
2024-02-08 18:47:44 +09:00
Minje Jun
e141d9e4d0 mem-ruby: Writeback CHI UD_RU line at local evict
In Ruby CHI protocol UD_RU state means the line is in UD state in
the local cache and the upstream may have it in UD or UC state.
In the previous implementation UD_RU line was just dropped without
WriteBack which can cause loss of dirty data when the upstream has it
in UC state.
This commit fixes it by performing WriteBack when evciting UD_RU line.

Change-Id: I1db9b4f95cc576e71dcef38b01de24775df514ba
2024-02-08 18:47:44 +09:00
Ivana Mitrovic
24e0d71034 arch-gcn3: Remove gcn3 (#781)
Related to issue #703 , this PR removes GCN3 related files and updates
source code, documentation, and tests to switch over to Vega is that was
not done already. Highlights are:

 - Remove all src/arch/amdgpu/gcn3 files and update Kconfigs.
 - Remove references to GCN3 and replace with Vega where applicable.
- Update the build targets in the gcn-gpu Docker. This will need to be
rebuilt but not urgently.
- Remove the GCN3 tag in testlib. Most tests seem to be using Vega
already, so that commit is small.
2024-01-25 10:14:46 -08:00
Matthew Poremba
63caa780c2 misc: Remove all references to GCN3
Replace instances of "GCN3" with Vega. Remove gfx801 and gfx803. Rename
FIJI to Vega and Carrizo to Raven.

Using misc since there is not enough room to fit all the tags.

Change-Id: Ibafc939d49a69be9068107a906e878408c7a5891
2024-01-17 11:11:06 -06:00
Nitish Arya
c2a22b03b4 mem-ruby: fix ruby startup() to reset exit event correctly (#773)
When restoring the simulate_limit_event pointer is not
restored after running the dry simulation run which ends up in
"Panic: event not found!"
In this commit we fix this issue by correctly restoring
the pointer value along with the event queue head

Change-Id: Id5ad4d2a270a6cd34eec1dc5c9b170b2b84610d4

---------

Co-authored-by: narya <nitish.arya@bsc.es>
Co-authored-by: Jason Lowe-Power <jason@lowepower.com>
2024-01-17 08:41:10 -08:00
Tiago Mück
b652ab8558 mem-ruby: fix missing txnId for prefetch requests (#734)
Internal prefetch message generation at AllocateTBE_PfRequest was
missing the expected txnId value.

Change-Id: I7d1ead24db947a15133f6ec45b27a47c70096682

Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2024-01-04 07:55:11 -08:00
Giacomo Travaglini
4f5d4b9baf mem-ruby: Implement WriteUniqueZero CHI transaction (#692)
The WriteUniqueZero is an immediate write to a Snoopable address region
that does not require any data transfer (cacheline is zeroed)

Change-Id: Ia8c9b40e08a3b7d613f0b62ce0ac4b0547860871

Reviewed-by: Tiago Muck <tiago.muck@arm.com>

Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-12-19 11:12:50 +00:00
Giacomo Travaglini
a008cd2611 mem-ruby: Implement a dummy StashOnceShared/Unique (#688)
Stash requests will simply be discarded by the Home Node This will
return a CompI response to the RNF

Change-Id: I9c2ce5d4d42f380d1a554933d381cf8a8590ba22

Reviewed-by: Tiago Muck <tiago.muck@arm.com>

Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-12-16 14:43:45 -08:00
Jason Lowe-Power
895944fa27 mem-ruby: Fix compile error in chi-dvm-funcs (#646)
clang correctly found that the functions `inCache`, `hasBeenPrefetched`
and `inMissQueue` had the wrong signatures in the DVM funcs files. These
functions are unused, so this change just updates their signatures.

Change-Id: Id669ff661e1c6c46eaf04ea1f17cd9866a9e49ed
2023-12-03 13:39:26 -08:00
Matt Sinclair
bd2838d18e mem-ruby: update CacheMemory RubyCache debug prints (#637)
Update the RubyCache debug flag prints in CacheMemory to be more
descriptive and make clearer what is happening in a given function.
This makes it easier to determine what is happening when looking at the
RubyCache debug flags prints.

Change-Id: Ieee172b6df0d100f4b1e8fe4bba872fc9cf65854
2023-12-01 16:11:52 -06:00
anoop
fc0a043950 mem-ruby: Unused L3CacheCntrl freed (#598)
Seems like the MOESI_AMD_Base-L3Cache.sm file is unused in the VIPER
protocol. It's confusing to have it in the GPU_VIPER.slicc file.
2023-12-01 13:01:19 -08:00
Matt Sinclair
0a2f9d4b18 mem-ruby: update CacheMemory RubyCache debug prints
Update the RubyCache debug flag prints in CacheMemory to be more
descriptive and make clearer what is happening in a given function.
This makes it easier to determine what is happening when looking at the
RubyCache debug flags prints.

Change-Id: Ieee172b6df0d100f4b1e8fe4bba872fc9cf65854
2023-12-01 12:31:52 -06:00
Jason Lowe-Power
b3e7af9d79 Support for classic prefetchers in Ruby (#502)
This patch adds supports for using the "classic" prefetchers with ruby
cache controllers.

This pull request includes a few commits making the changes in this
order:
- Refactor decouples the classic cache and prefetchers interfaces
- Extras probes for later integration with ruby
- General ruby-side support
- Adds support for the CHI protocol

Commit [mem-ruby: support prefetcher in CHI
protocol](2bdb65653b)
may be used as example on how to add support for other protocols.

JIRA issues that may be related to this pull request:
    https://gem5.atlassian.net/browse/GEM5-457
    https://gem5.atlassian.net/browse/GEM5-1112
2023-11-30 10:24:29 -08:00
Bobby R. Bruce
d11c40dcac misc: Run pre-commit run --all-files
This ensures `isort` is applied to all files in the repo.

Change-Id: Ib7ced1c924ef1639542bf0d1a01c5737f6ba43e9
2023-11-29 22:06:41 -08:00
Tiago Mück
0f8c60bce5 mem-ruby: add missing state for CHI Prefetch event
RUSC state is missing for the Prefetch event.

Change-Id: If440ac0052100dba295708471a75a24cd234c011
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2023-11-28 18:30:50 -06:00
Tiago Mück
91cf58871e mem-ruby: support prefetcher in CHI protocol
Use RubyPrefetcherProxy to support prefetchers in the
CHI cache controller

L1I/L1D/L2 prefechers can now be added by specifying a non-null
prefetcher type when configuring a CHI_RNF.

Related JIRA:
https://gem5.atlassian.net/browse/GEM5-457
https://gem5.atlassian.net/browse/GEM5-1112

Additional authors:
    Tuan Ta <tuan.ta2@arm.com>

Change-Id: I41dc637969acaab058b22a8c9c3931fa137eeace
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2023-11-28 18:30:50 -06:00
Tiago Mück
94d5cc17a2 mem-ruby,mem-cache: ruby supports classic pfs
This patch adds RubyPrefetcherProxy, which provides means to inject
requests generated by the "classic" prefetchers into a SLICC prefetch
queue. It defines defines notifyPf* functions to be used by protocols
to notify a prefetcher. It also includes the probes required to
interface with the classic implementation.
AbstractController defines the accessor needed to snoop the caches.

A followup patch will add support for RubyPrefetcherProxy in the
CHI protocol.

Related JIRA:
https://gem5.atlassian.net/browse/GEM5-457
https://gem5.atlassian.net/browse/GEM5-1112

Additional authors:
    Tuan Ta <tuan.ta2@arm.com>

Change-Id: Ie908150b510f951cdd6fd0fd9c95d9760ff70fb0
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2023-11-28 18:30:50 -06:00
Gabe Black
db3a6e8e84 scons: Use Kconfig to configure gem5.
These are not yet consumed by anything, but convert all the settings
from SCons variables to Kconfig variables.

If you have existing SConsopts files which need to be converted, you
should take a look at KCONFIG.md to learn about how kconfig is used in
gem5. You should decide if any variables need to be available to C++ or
kconfig itself, and whether those are options which should be detected
automatically, or should be up to the user. Options which should be
measured automatically should still be in SConsopts files, while user
facing options should be added to new or existing Kconfig files.

Generally, make sure you're storing c++/kconfig visible options in
env['CONF'][...]. Also remove references to sticky_vars since persistent
options should now be handled with kconfig, and export_vars since
everything in env['CONF'] is now exported automatically.

Switch SCons/gem5 to use Kconfig for configuration, except EXTRAS which
is still a sticky SCons variable. This is necessary because EXTRAS also
controls what config options exist. If it came from Kconfig itself, then
there would be a circular dependency. This dependency could
theoretically be handled by reparsing the Kconfig when EXTRAS
directories were added or removed, but that would be complicated, and
isn't supported by kconfiglib. It wouldn't be worth the significant
effort it would take to add it, just to use Kconfig more purely.

Change-Id: I29ab1940b2d7b0e6635a490452d05befe5b4a2c9
2023-11-23 08:26:10 +08:00
Matthew Poremba
6e433ed885 mem-ruby: Fixes for new AtomicWait event in VIPER TCC (#585)
The AtomicWait event was not being woken up properly due to the
numPending count in the TBE not being decremented. This patch decrements
the count when Data is returned. Since that moves to a base state, the
TBE should no longer be needed.

Additionally added a transition which stalls and wait when an AtomicWait
occurs while in WI state so that it retries.

Change-Id: Ic8bfc700f9df3f95bea0799121898926a23d8163
2023-11-22 14:05:43 -08:00
Hoa Nguyen
3009e0fb57 mem-ruby: Fix typo in CHI's Send_CompI (#579)
The destination for the response is set twice.
2023-11-20 21:38:13 -08:00
Giacomo Travaglini
4ca2efac16 mem-ruby: AtomicNoReturn should check comp_anr instead of comp_wu (#545)
The comp_anr parameter is currently unused. Both parameters (comp_wu and
comp_anr) are set to false by default

Change-Id: If09567504540dbee082191d46fcd53f1363d819f

Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-16 15:20:51 -08:00
Matt Sinclair
c3326c78e6 mem-ruby, gpu-compute: fix SQC/TCP requests to same line
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another.  The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$).  If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned.  At that point they can be serviced, and assuming the line
has not been evicted, they should hit.

For example, in the following test (on 1 GPU thread, in 1 WG):

load Arr[0]
load Arr[1]
load Arr[2]

The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.

However, the current support in the VIPER SQC/TCP code does not model
this correctly.  Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced.  This causes all 3 requests to be
classified as misses.

To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from being
sent in parallel with the original one.  To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line.  If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP.  Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.

As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).

As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests).  This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.

Although the SQC does not support stores, the TCP does.  Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).

Additionally, making these changes introduced issues with deadlocks at
the TCC.  Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.

After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed.  This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to).  Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores.  Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request.  Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.

However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel.  Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.

Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-15 19:23:51 -06:00
Matt Sinclair
065ddf759f mem-ruby, gpu-compute: fix bug with GPU bypassing loads
The current GPU TCP (L1D$) Ruby SLICC code had a bug where a GPU
load that wants to bypass the L1D$ (e.g., GLC or SLC bit was set)
but the line is in Invalid when that request arrives, results in
a non-bypassing load being sent to the GPU TCC (L2$) instead of
a bypassing load.

This issue was not caught by currently nightly or weekly tests,
because the tests do not test for correctness in terms of hits
and misses in the caches.  However, tests for these corner cases
expose this issue.

To fix, this, this patch removes the check that the entry is valid
when deciding what to do with a bypassing GPU load -- since the
TCP Ruby code has transitions for bypassing loads in both I and V,
we can simply call the LoadBypassEvict event in both cases and the
appropriate transition will handle the bypassing load given the
cache line's current state in the TCP.

Change-Id: Ia224cefdf56b4318b2bcbd0bed995fc8d3b62a14
2023-11-15 19:23:51 -06:00
BujSet
4a5ec70e08 gpu-compute: Minor edits for atomic no returns and stores (#565)
Since returned data is not needed for AtomicNoReturn and Store memory
requests, the coalescer need not spend time writing in dummy data for
packets of these types.

Change-Id: Ie669e8c2a3bf44b5b0c290f62c49c5d4876a9a6a
2023-11-15 07:20:07 -08:00
BujSet
65b44e6516 mem-ruby: Fix for not creating log entries on atomic no return requests (#546)
Augmenting Datablock and WriteMask to support optional arg to
distinguish between return and no return. In the case of atomic no
return requests, log entries should not be created when performing the
atomic.

Change-Id: Ic3112834742f4058a7aa155d25ccc4c014b60199a
2023-11-14 07:54:42 -08:00
Daniel Kouchekinia
be5c03ea9f mem-ruby,configs: Add GPU GLC Atomic Resource Constraints (#120)
Added a resource constraint, AtomicALUOperation, to GLC atomics
performed in the TCC.

The resource constraint uses a new class, ALUFreeList array. The class
assumes the following:
  - There are a fixed number of atomic ALU pipelines
- While a new cache line can be processed in each pipeline each cycle,
if a cache line is currently going through a pipeline, it can't be
processed again until it's finished

Two configuration parameters have been used to tune this behavior:
- tcc-num-atomic-alus corresponds to the number of atomic ALU pipelines
- atomic-alu-latency corresponds to the latency of atomic ALU pipelines

Change-Id: I25bdde7dafc3877590bb6536efdf57b8c540a939
2023-11-14 07:48:48 -08:00
Matt Sinclair
48fde5a9c6 mem-ruby, gpu-compute: fix formatting of TCC (#536)
mem-ruby, gpu-compute: fix formatting of TCC

Fix several not properly indented prints and extraneous extra lines in
the SLICC code for the GPU TCC (L2 cache).
2023-11-13 15:01:30 -08:00
Matt Sinclair
7d0a1fb284 mem-ruby, gpu-compute: fix typo in GPU coalescer deadlock print (#535)
mem-ruby, gpu-compute: fix typo in GPU coalescer deadlock print 
 
The GPU Coalescer's deadlock print did not previously print a newline at
the end of each deadlock, which caused confusion when there were
multiple deadlocks as each deadlock print would appear to go with the
address after it. This patch fixes this issue.
2023-11-13 15:01:01 -08:00
Matt Sinclair
f312804364 mem-ruby: fix hex print in CacheMemory (#561)
Update print in CacheMemory about clearing the lock to properly print in
hex.
2023-11-13 14:34:33 -08:00
Matt Sinclair
3642bc4892 mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting (#538)
mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting

Fix several not properly indented prints and extraneous extra lines in
the SLICC code for the GPU SQC (L1I$) and TCP (L1D$).
2023-11-13 14:20:54 -08:00
Matt Sinclair
f61d709321 mem-ruby: update RubyRequest print to include GPU fields (#537)
mem-ruby: update RubyRequest print to include GPU fields

The print function used for RubyRequests did not include the GPU
specific fields (for the GLC and SLC bits, which are cache modifiers
that specify what level of the memory hierarchy a request should be
performed at). This causes confusion when the GPU Ruby SLICC code prints
out RubyRequest messages, since important fields are missing.

Thus this commit adds that support. Since these fields are already part
of the RubyRequest class, and are always 0 for non-GPU requests, it
should not affect other components beyond slightly longer prints.

Change-Id: I31c9122b82dfa2c6415ce25d225ea82cb35c7333
2023-11-13 01:12:25 -06:00
Daniel Kouchekinia
1204267fd8 mem-ruby: SLICC Fixes to GLC Atomics in WB L2 (#397)
Made the following changes to fix the behavior of GLC atomics in a WB
L2:
- Stored atomic write mask in TBE For GLC atomics on an invalid line
that bypass to the directory, but have their atomics performed on the
return path.
- Replaced !presentOrAvail() check for bypassing atomics to directory
(which will then be performed on return path), with check for invalid
line state.
- Replaced wdb_writeDirtyBytes action used when performing atomics with
owm_orWriteMask action that doesn't write from invalid atomic request
data block
   - Fixed atomic return path actions

Change-Id: I6a406c313d2f9c88cd75bfe39187ef94ce84098f
2023-11-09 13:15:10 -08:00