derek/gem5 - gem5 - Gitea: Git with a cup of tea

derek/gem5

Author	SHA1	Message	Date
Matthew Poremba	1d64669473	mem,gpu-compute: Implement GPU TCC directed invalidate The GPU device currently supports large BAR which means that the driver can write directly to GPU memory over the PCI bus without using SDMA or PM4 packets. The gem5 PCI interface only provides an atomic interface for BAR reads/writes, which means the values cannot go through timing mode Ruby caches. This causes bugs as the TCC cache is allowed to keep clean data between kernels for performance reasons. If there is a BAR write directly to memory bypassing the cache, the value in the cache is stale and must be invalidated. In this commit a TCC invalidate is generated for all writes over PCI that go directly to GPU memory. This will also invalidate TCP along the way if necessary. This currently relies on the driver synchonization which only allows BAR writes in between kernels. Therefore, the cache should only be in I or V state. To handle a race condition between invalidates and launching the next kernel, the invalidates return a response and the GPU command processor will wait for all TCC invalidates to be complete before launching the next kernel. This fixes issues with stale data in nanoGPT and possibly PENNANT. Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf	2024-04-10 11:35:25 -07:00
Matt Sinclair	c3326c78e6	mem-ruby, gpu-compute: fix SQC/TCP requests to same line Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug where they do not behave correctly when multiple requests to the same cache line overlap one another. The intended behavior is that if the first request that arrives at the Ruby code for the SQC/TCP misses, it should send a request to the GPU TCC (L2$). If any requests to the same cache line occur while this first request is pending, they should wait locally at the L1 in the MSHRs (TBEs) until the first request has returned. At that point they can be serviced, and assuming the line has not been evicted, they should hit. For example, in the following test (on 1 GPU thread, in 1 WG): load Arr[0] load Arr[1] load Arr[2] The expected behavior (confirmed via profiling on real GPUs) is that we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a program. However, the current support in the VIPER SQC/TCP code does not model this correctly. Instead it lets all 3 concurrent requests go straight through to the TCC instead of stopping the Arr[1] and Arr[2] requests locally while Arr[0] is serviced. This causes all 3 requests to be classified as misses. To resolve this, this patch adds support into the SQC/TCP code to prevent subsequent, concurrent requests to a pending cache line from being sent in parallel with the original one. To do this, we add an additional transient state (IV) to indicate that a load is pending to this cache line. If a subsequent request of any kind to the same cache line occurs while this load is pending, the requests are put on the local wait buffer and woken up when the first request returns to the SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it transitions from IV --> V. As part of this support, additional transitions were also added to account for corner cases such as what happens when the line is evicted by another request that maps to the same set index while the first load is pending (the line is immediately given to the new request, and when the load returns it completes, wakes up any pending requests to the same line, but does not attempt to change the state of the line) and how GPU bypassing loads and stores should interact with the pending requests (they are forced to wait if they reach the L1 after the pending, non-bypassing load; but if they reach the L1 before the non-bypassing load then they make sure not to change the state of the line from IV if they return before the non-bypassing load). As part of this change, we also move the MSHR behavior from internally in the GPUCoalescer for loads to the Ruby code (like all other requests). This is important to get correct hits and misses in stats and other prints, since the GPUCoalescer MSHR behavior assumes all requests serviced out of its MSHR also miss if the original request to that line missed. Although the SQC does not support stores, the TCP does. Thus, we could have applied a similar change to the GPU stores at the TCP. However, since the TCP support assumes write-through caches and does not attempt to allocate space in the TCP, we elected not to add this support since it seems to run contrary to the intended behavior (i.e., the intended behavior seems to be that writes just bypass the TCP and thus should not need to wait for another write to the same cache line to complete). Additionally, making these changes introduced issues with deadlocks at the TCC. Specifically, some Pannotia applications have accesses to the same cache line where some of the accesses are GLC (i.e., they bypass the GPU L1 cache) and others are non-GLC (i.e., they want to be cached in the GPU L1 cache). We have support already per CU in the above code. However, the problem here is that these requests are coming from different CUs and happening concurrently (seemingly because different WGs are at different points in the kernel around the same time). This causes a problem because our support at the TCC for the TBEs overwrites the information about the GPU bypassing bits (SLC, GLC) every time. The problem is when the second (non-GLC) load reaches the TCC, it overwrites the SLC/GLC information for the first (GLC) load. Thus, when the the first load returns from the directory/memory, it no longer has the GLC bit set, which causes an assert failure at the TCP. After talking with other developers, it was decided the best way handle this and attempt to model real hardware more closely was to move the point at which requests are put to sleep on the wakeup buffer from the TCC to the directory. Accordingly, this patch includes support for that -- now when multiple loads (bypassing or non-bypassing) from different CUs reach the directory, all but the first one will be forced to wait there until the first one completes, then will be woken up and performed. This required updating the WTRequestor information at the TCC to pass the information about what CU performed the original request for loads as well (otherwise since the TBE can be updated by multiple pending loads, we can't tell where to send the final result to). Thus, I changed the field to be named CURequestor instead of WTRequestor since it is now used for more than stores. Moreover, I also updated the directory to take this new field and the GLC information from incoming TCC requests and then pass that information back to the TCC on the response -- without doing this, because the TBE can be updated by multiple pending, concurrent requests we cannot determine if this memory request was a bypassing or non-bypassing request. Finally, these changes introduced a lot of additional contention and protocol stalls at the directory, so this patch converted all directory uses of z_stall to instead put requests on the wakeup buffer (and wake them up when the current request completes) instead. Without this, protocol stalls cause many applications to deadlock at the directory. However, this exposed another issue at the TCC: other applications (e.g., HACC) have a mix of atomics and non-atomics to the same cache line in the same kernel. Since the TCC transitions to the A state when an atomic arrives. For example, after the first pending load returns to the TCC from the directory, which causes the TCC state to become V, but when there are still other pending loads at the TCC. This causes invalid transition errors at the TCC when those pending loads return, because the A state thinks they are atomics and decrements the pending atomic count (plus the loads are never sent to the TCP as returning loads). This patch fixes this by changing the TCC TBEs to model the number of pending requests, and not allowing atomics to be issued from the TCC until all prior, pending non-atomic requests have returned. Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563	2023-11-15 19:23:51 -06:00
Matt Sinclair	065ddf759f	mem-ruby, gpu-compute: fix bug with GPU bypassing loads The current GPU TCP (L1D$) Ruby SLICC code had a bug where a GPU load that wants to bypass the L1D$ (e.g., GLC or SLC bit was set) but the line is in Invalid when that request arrives, results in a non-bypassing load being sent to the GPU TCC (L2$) instead of a bypassing load. This issue was not caught by currently nightly or weekly tests, because the tests do not test for correctness in terms of hits and misses in the caches. However, tests for these corner cases expose this issue. To fix, this, this patch removes the check that the entry is valid when deciding what to do with a bypassing GPU load -- since the TCP Ruby code has transitions for bypassing loads in both I and V, we can simply call the LoadBypassEvict event in both cases and the appropriate transition will handle the bypassing load given the cache line's current state in the TCP. Change-Id: Ia224cefdf56b4318b2bcbd0bed995fc8d3b62a14	2023-11-15 19:23:51 -06:00
BujSet	65b44e6516	mem-ruby: Fix for not creating log entries on atomic no return requests (#546 ) Augmenting Datablock and WriteMask to support optional arg to distinguish between return and no return. In the case of atomic no return requests, log entries should not be created when performing the atomic. Change-Id: Ic3112834742f4058a7aa155d25ccc4c014b60199a	2023-11-14 07:54:42 -08:00
Matt Sinclair	3642bc4892	mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting (#538 ) mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting Fix several not properly indented prints and extraneous extra lines in the SLICC code for the GPU SQC (L1I$) and TCP (L1D$).	2023-11-13 14:20:54 -08:00
Vishnu Ramadas	085789d00c	mem-ruby: Add flush support to GPU_VIPER protocol This commit adds flush support to the GPU VIPER coherence protocol. The L1 cache will now initiate a flush request if the packet it receives is of type RubyRequestType_FLUSH. During the flush process, the L1 cache will a request to L2 if its in either V or I state. L2 will issue a flush request to the directory if its cache line is in the valid state before invalidating its copy. The directory, on receiving this request, writes data to memory and sends an ack back to the L2. L2 forwards this ack back to the L1, which then ends the flush by calling the write callback Change-Id: I9dfc0c7b71a1e9f6d5e9e6ed4977c1e6a3b5ba46	2023-10-02 19:05:10 -05:00
Matt Sinclair	4e61a98336	mem-ruby: add GPU cache bypass I->I transition `66d4a158` added support for AMD's GPU cache bypassing flags (GLC for bypassing L1 caches, SLC for bypassing all caches). However, it did not add a transition for the situation where the cache line is currently I (Invalid). This commit adds this support, which resolves an assert failure in Pannotia workloads when this situation arises. Change-Id: I59a62ce70c01dd8b73aacb733fb3d1d0dab2624b Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67201 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com>	2023-01-08 20:24:11 +00:00
Matt Sinclair	1d467bed7f	mem-ruby: fix TCP spacing/spelling Change-Id: I3fd9009592c8716a3da19dcdccf68f16af6522ef Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67200 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2023-01-08 20:24:11 +00:00
Matt Sinclair	24e2ef0b78	mem-ruby, gpu-compute: fix TCP GLC cache bypassing `66d4a158` added support for AMD's GPU cache bypassing flags (GLC for bypassing L1 caches, SLC for bypassing all caches). However, for applications that use the GLC flag but intermix GLC- and non-GLC accesses to the same address, this previous commit has a bug. This bug manifests when the address is currently valid in the L1 (TCP). In this case, the previous commit chose to evict the line before letting the bypassing access to proceed. However, to do this the previous commit was using the inv_invDone action as part of the process of evicting it. This action is only intended to be called when load acquires are being performed (i.e., when the entire L1 cache is being flash invalidated). Thus, calling inv_invDone for a GLC (or SLC) bypassing request caused an assert failure since the bypassing request was not performing a load acquire. This commit resolves this by changing the support in this case to simply invalidate the entry in the cache. Change-Id: Ibaa4976f8714ac93650020af1c0ce2b6732c95a2 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67199 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com>	2023-01-08 20:24:11 +00:00
Vishnu Ramadas	66d4a15820	gpu-compute,mem-ruby: Add support for GPU cache bypassing The GPU cache models do not support cache bypassing when the GLC or SLC AMDGPU instruction modifiers are used in a load or store. This commit adds cache bypass support by introducing new transitions in the coherence protocol used by the GPU memory system. Now, instructions with the GLC bit set will not cache in the L1 and instructions with SLC bit set will not cache in L1 or L2. Change-Id: Id29a47b0fa7e16a21a7718949db802f85e9897c3 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66991 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2023-01-03 21:19:24 +00:00
Matt Sinclair	9c1af09605	mem-ruby, gpu-compute: update TCP,SQC to pass hit/miss Previously, the GPU SQC and TCP Ruby protocols always told the Sequencer that the externalHit field was false. This impacts the statistics and profiling, because the Sequencer uses this hit/miss information both for profiling and the coalescer's statistics. To resolve this, this commit updates the GPU SQC and TCP Ruby protocols to pass the appropriate hit/miss information into the Sequencer's readCallback and hitCallback functions. Change-Id: Ib74af09b66fa8866eee72d3a9ab0e8a8f2196c03 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/60652 Reviewed-by: Matthew Poremba <matthew.poremba@amd.com> Maintainer: Matthew Poremba <matthew.poremba@amd.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-06-21 22:59:05 +00:00
Matthew Poremba	9313294efe	misc: Remove AMD license addition Remove the line "For use for simulation and test purposes only" in files were AMD is the only copyright holder listed in the header. This happens to be the case for all files where this line exists, removing it completely from gem5. Change-Id: I623f266b002f564301b28774f49081099cfc60fd Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53943 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2021-12-11 04:00:56 +00:00
Gabriel Busnot	9c2aac17b9	mem-ruby: Rename WriteMask::cmpMask to containsMask Avoids confusion as the function tests for inclusions and not for equality. Change-Id: I4cd10e08af46f69feed26afc2d6c7f809bc5192b Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/46560 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Tiago Mück <tiago.muck@arm.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2021-06-08 07:56:09 +00:00
Hoa Nguyen	4c42811ff3	mem-ruby: Move CacheMemory stats used in SLICC to a Stats group This change moves some stats that are used in SLICC to a separate Stats::Group. In order to use stats in SLICC, new functions are added in CacheMemory: - profileDemandHit() - profileDemandMiss() The functions increase the corresponding stat by 1. Change-Id: I52b6fefdf6579a49f626f2fca400641f90800017 Signed-off-by: Hoa Nguyen <hoanguyen@ucdavis.edu> Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/37815 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Tiago Mück <tiago.muck@arm.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com>	2020-12-22 09:52:36 +00:00
Kyle Roarty	1339a1b080	mem-ruby: add cache hit/miss statistics for TCP and TCC Change-Id: Ifa6fdbb9dd062a3684b9620eac6683c57e651a72 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/30174 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Bradford Beckmann <brad.beckmann@amd.com> Maintainer: Bradford Beckmann <brad.beckmann@amd.com>	2020-06-20 04:20:45 +00:00
Tuan Ta	18ebe62598	mem-ruby: GCN3 and VIPER integration This patch modifies the Coalescer and VIPER protocol to support memory synchronization requests and write-completion responses that are required by upcoming GCN3 implementation. VIPER protocol is simplified to be a solely write-through protocol. Change-Id: Iccfa3d749a0301172a1cc567c59609bb548dace6 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/29913 Reviewed-by: Anthony Gutierrez <anthony.gutierrez@amd.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Bradford Beckmann <brad.beckmann@amd.com> Maintainer: Anthony Gutierrez <anthony.gutierrez@amd.com> Maintainer: Bradford Beckmann <brad.beckmann@amd.com> Tested-by: kokoro <noreply+kokoro@google.com>	2020-06-19 20:32:54 +00:00
Tony Gutierrez	b8da9abba7	gpu-compute, mem-ruby, configs: Add GCN3 ISA support to GPU model Change-Id: Ibe46970f3ba25d62ca2ade5cbc2054ad746b2254 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/29912 Reviewed-by: Anthony Gutierrez <anthony.gutierrez@amd.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Anthony Gutierrez <anthony.gutierrez@amd.com> Tested-by: kokoro <noreply+kokoro@google.com>	2020-06-15 22:45:17 +00:00
Gabe Black	c08351f4d3	mem: Move ruby protocols into a directory called ruby_protocol. Now that the gem5 protocols are split out, it would be nice to put them in their own protocol directory. It's also confusing to have files called *_protocol which are not in the protocol directory. Change-Id: I7475ee111630050a2421816dfd290921baab9f71 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/20230 Reviewed-by: Gabe Black <gabeblack@google.com> Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com> Reviewed-by: Jason Lowe-Power <jason@lowepower.com> Maintainer: Gabe Black <gabeblack@google.com> Tested-by: kokoro <noreply+kokoro@google.com>	2019-08-23 21:13:07 +00:00

18 Commits