Commit Graph

14908 Commits

Author SHA1 Message Date
Tiago Mück
becba00d95 mem-cache,configs: remove extra prefetch_* params
Remove the prefetch_on_access and prefetch_on_pf_hit from BaseCache.
BasePrefetch no longer expects this params to exist in the parent.

Configurations that set these parameter using the cache object were
fixed.

Change-Id: I9ab6a545eaf930ee41ebda74e2b6b8bad0ca35a7
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2023-11-28 18:30:49 -06:00
Tiago Mück
af2ee0db30 mem-cache: decoupled prefetchers from cache
This patches decouples the prefetchers from the cache implementation
as the first step to allow using the classic prefetchers with ruby
caches. The prefetchers that need do cache lookups can do so using
the accessor object provided when the probes are notified. This may
also facilitate connecting the same prefetcher to multiple caches.

Related JIRA:
https://gem5.atlassian.net/browse/GEM5-457
https://gem5.atlassian.net/browse/GEM5-1112

Change-Id: I4fee1a3613ae009fabf45d7b747e4582cad315ef
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
2023-11-28 18:30:49 -06:00
Jason Lowe-Power
3fe5e58f28 arch-x86: Fix misc registers in mov instructions (#593)
MOV instructions 8C and 8E can be prefixed with a REX prefix to extend
the source/destination register.
However, the R bit in REX will be applied to the segment register.  
The decoder file checks for valid segment registers, checking the
MODRM_REG only, however, later this will be extended with the REX_R when
adding the register to the sources/destinations of the instruction.
This will trigger an assert.

Additionally, MOV instructions of various miscelaneous registers are
also not check for being valid when taking into account the REX_R bit.

This patch checks that the REX_R is not set, otherwise, UD2 will be
generated.
2023-11-28 11:14:53 -08:00
Andreas Sandberg
0c30353c59 cpu: Require BTB hit to detect branches. (#493)
In a high performance CPU there is no other way than a BTB hit
to know about a branch instruction and its type. For low-end CPU's
pre-decoding might sit in from of the BPU to provide this information.
Currently, the BPU models only low-end behavior and updates the
RAS and the indirect branch prediction even without a BTB hit.
This patch adds three things to model the correct behavior for high-end
CPUs.
1. A check before the RAS and indirect predictor wheather there was
a BTB hit or not. Only for BTB hits the BPU will consolidate RAS, and
indirect predictor.
2. Since, this check requires a BTB hit for indirect branches they must
also be installed into the BTB. For returns this was already done.
3. Finally, the BTB update previously happened at squash (decode
or commit). Since this can be out-of-order that means branches from
the false path can get installed without ever been retired.
2023-11-28 09:39:14 +00:00
Roger Chang
9a0c671cce arch-riscv: Handle the exception following the privilege mode set
Change-Id: I4867941ec286fe485e01db848b8c7357488f6cf4
2023-11-28 09:26:27 +08:00
Roger Chang
d56801c240 arch-riscv: Add misa rvs check for memory translation
The memory translation require supervisor mode implement. If the
supervisor mode is not implemented, the satp CSR is not exists and
should not do address translation

Change-Id: Ie6c8a1a130d0aab0647b35e0f731f6b930834176
2023-11-28 09:26:27 +08:00
Roger Chang
6fd4feb797 arch-riscv: fatal_if the process run without SU modes
Change-Id: Ifce7eec6cea10881964c29d206a92f3d10271de6
2023-11-28 09:26:27 +08:00
Roger Chang
9e738a65ea arch-riscv: Add isaExts field for CSR registers
Change-Id: Idd94af57f3a721d455ea7fb9d335fab7b16a0f7e
2023-11-28 09:26:27 +08:00
Roger Chang
0e4f82a119 arch-riscv: define the CSR masks for each privilege modes
Change-Id: I9936d9bc816921a827b94550847d4898b3aa3292
2023-11-28 09:26:27 +08:00
Roger Chang
f745e8cf89 arch-riscv: Initial the privilege modes configuration
1. Declare the new enum type PrivilegeModes
2. Disallow setting the MISA register RVU and RVS.

Change-Id: I932d714bc70c9720a706353c557a5be76c950f81
2023-11-28 09:26:27 +08:00
Aditya K Kamath
9a0566e295 arch-x86: Fixes page fault for CLFLUSH on write-protected pages
Converts CLFLUSHOPT/WB/FLUSH operations from Write to Read operations
during address translation so that they don't trigger a page fault
when done on write-protected pages.

Change-Id: I20e89cc0cb2b288b36ba1f0ba39a2e1bf0f728af
2023-11-28 00:42:17 +00:00
Bobby R. Bruce
d94d6017b0 scons: Change to Kconfig build system (#69)
The PR contains the following changes:
- Move all of the config options(`env["CONF"]`) from SConsopt to Kconfig
files
- Update `build_opts` files to Kconfig option formats
- The Ruby Protocol files are only built if `RUBY=y`
- Remove the default-default build target
- Kconfig commands are included in the PR:
    - defconfig
    - setconfig
    - meunconfig
    - guiconfig
    - listnewconfig
    - savedefconfig
    - oldconfig
    - olddefconfig
- Add the `python3-tk` package dependencies
 
Jira issue: https://gem5.atlassian.net/browse/GEM5-1211
2023-11-27 13:59:18 -08:00
Matthew Poremba
9e6a87e67a dev-amdgpu: Writeback PM4 queue rptr when empty (#597)
The GPU device keeps a local copy of each ring buffers read pointer
(rptr) to avoid constant DMAs to/from host memory. This means it needs
to be periodically updated on the host side as the driver uses this to
determine how much space is left in the queue and may hang if it believe
the queue is full. For user-mode queues, this already happens when
queues are unmapped. For kernel mode queues (e.g., HIQ, KIQ) the rptr is
never updated leading to a hang.

In this patch the rptr for *all* queues is reported back to the kernel
whenever the queue reaches an empty state (rptr == wptr). Additionally
to handle PM4 queue wrap-around, the queue processing function checks if
the queue is not empty instead of rptr < wptr. This is state because the
driver fills PM4 queues with NOP packets on initialization and when wrap
around occurs.

Change-Id: Ie13a4354f82999208a75bb1eaec70513039ff30f
2023-11-27 11:02:11 -08:00
Bobby R. Bruce
d4b7c8a26d Merge branch 'develop' into develop-kconfig 2023-11-27 09:39:08 -08:00
Matthew Poremba
cc9f81b08a arch-vega,arch-gcn3: Bugfix V_PERM_B32 and V_OR3_B32 (#599)
The V_PERM_B32 instruction is selecting the correct byte, but is
shifting into place moving by bits instead of bytes. The V_OR3_B32
instruction is calling the wrong instruction implementation in the
decoder.

This patch fixes both issues plus a bonus fix for GCN3's V_PERM_B32.
(GCN3 does not have V_OR3_B32).

Change-Id: Ied66c43981bc4236f680db42a9868f760becc284
2023-11-26 23:22:01 -08:00
Bobby R. Bruce
0b2c56ef66 mem-cache: Revert "Prefetchers Improvements" (#581)
Reverts gem5/gem5#564 to fix #580.

Discussion in #581 showed there may be a fix to this but reverting for now until 
a better solution is found.
2023-11-26 18:43:21 -08:00
Bobby R. Bruce
ab1d5dc3a0 arch-arm: Fix Virtual Interrupt logic in secure mode (#584)
This PR is fixing remaining issues in the ArmISA::Interrupt class; more
specifically it is enabling
virtual interrupts in secure mode (when FEAT_SEL2 is present). Previous
version was assuming no
virtual interrupt was possible in secure mode. We fix this assumption by
replacing the security check
with the EL2Enabled helper which closely matches the Arm pseudocode
2023-11-26 18:11:08 -08:00
Nitesh Narayana
35ccd7f907 arch-arm: This commit adds the mla/s indexed versions
This includes the isa and instruction implementations
of mla and mls indexed versions from ARM SVE2 ISA spec.

Change-Id: I4fbd0382f23d8611e46411f74dc991f5a211a313
2023-11-24 15:20:30 +01:00
Eduardo José Gómez Hernández
670bf6a488 arch-x86: Check REX_R for MOV misc registers
Change-Id: I08ea37ffe695df500ea84cbddd94be246f916caf
2023-11-24 13:41:24 +01:00
Eduardo José Gómez Hernández
cea169f5e7 arch-x86: Fix segment registers in instructions 8C and 8E
MOV instructions 8C and 8E can be prefixed with a REX prefix to extend
the source/destination register. However, the R bit in REX will be
applied to the segment register.  The decoder file checks for valid
segment registers, checking the MODRM_REG only, however, later this
will be extended with the REX_R when adding the register to the
sources/destinations of the instruction.  This will trigger an assert.

This patch checks that the REX_R is not set, otherwise, UD2 will be
generated.

Change-Id: I78a93c35116232fe37e5ec50025e721b8c633c5f
2023-11-23 10:18:17 +01:00
Roger Chang
92670e9745 fastmodel: Simply the logic of USE_ARM_FASTMODEL setting
Change-Id: Ib00cf83ca881727987050a987a2adb1e9f9d31ef
2023-11-23 14:15:28 +08:00
Roger Chang
4d632cb73f scons: Add new config option HAVE_CAPSTONE to Kconfig
The config option HAVE_CAPSTONE is added in the previous [1] and
the Kconfig options should be sync with it.

[1] https://github.com/gem5/gem5/pull/494

Change-Id: Id83718bc825f53d87d37d6ac930b96371209bdb3
2023-11-23 08:26:11 +08:00
Roger Chang
7b35765217 scons: Refactor the USE_SYSTEMC option
Change-Id: I2f51081e0db932b83eea9dd395551afe13d54a34
2023-11-23 08:26:11 +08:00
Roger Chang
d758df4b5c scons: Update the Kconfig build options
The CL updates the Kconfig:
1. Replace the USE_NULL_ISA with BUILD_ISA
2. The USE_XXX_ISAs are depends on BUILD_ISA
3. If the BUILD_ISA is set, at least one of USE_XXX_ISAs must be set
4. Refactor the USE_KVM option

Change-Id: I2a600dea9fb671263b0191c46c5790ebbe91a7b8
2023-11-23 08:26:11 +08:00
Gabe Black
db3a6e8e84 scons: Use Kconfig to configure gem5.
These are not yet consumed by anything, but convert all the settings
from SCons variables to Kconfig variables.

If you have existing SConsopts files which need to be converted, you
should take a look at KCONFIG.md to learn about how kconfig is used in
gem5. You should decide if any variables need to be available to C++ or
kconfig itself, and whether those are options which should be detected
automatically, or should be up to the user. Options which should be
measured automatically should still be in SConsopts files, while user
facing options should be added to new or existing Kconfig files.

Generally, make sure you're storing c++/kconfig visible options in
env['CONF'][...]. Also remove references to sticky_vars since persistent
options should now be handled with kconfig, and export_vars since
everything in env['CONF'] is now exported automatically.

Switch SCons/gem5 to use Kconfig for configuration, except EXTRAS which
is still a sticky SCons variable. This is necessary because EXTRAS also
controls what config options exist. If it came from Kconfig itself, then
there would be a circular dependency. This dependency could
theoretically be handled by reparsing the Kconfig when EXTRAS
directories were added or removed, but that would be complicated, and
isn't supported by kconfiglib. It wouldn't be worth the significant
effort it would take to add it, just to use Kconfig more purely.

Change-Id: I29ab1940b2d7b0e6635a490452d05befe5b4a2c9
2023-11-23 08:26:10 +08:00
Matthew Poremba
6e433ed885 mem-ruby: Fixes for new AtomicWait event in VIPER TCC (#585)
The AtomicWait event was not being woken up properly due to the
numPending count in the TBE not being decremented. This patch decrements
the count when Data is returned. Since that moves to a base state, the
TBE should no longer be needed.

Additionally added a transition which stalls and wait when an AtomicWait
occurs while in WI state so that it retries.

Change-Id: Ic8bfc700f9df3f95bea0799121898926a23d8163
2023-11-22 14:05:43 -08:00
Bobby R. Bruce
23a22ed95c dev-amdgpu: Add VMID map to checkpoint (#570)
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit adds existing IDs to the GPU device's used
VMID map so that new doorbells are aware of existing queue IDs and use a
new ID. This ensures that queue IDs are unique after checkpoint
restoration
2023-11-22 10:05:21 -08:00
Giacomo Travaglini
098feb4042 arch-arm: Fix WFI sleeping in secure mode
The CPU should not sleep with a pending virtual interrupt
if secure mode EL2 is supported (FEAT_SEL2)

Change-Id: Ib71c4a09d76a790331cf6750da45f83694946aee
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Giacomo Travaglini
b8fabc15d9 arch-arm: Revamp takeVirtualInt to take FEAT_SEL2 into account
Similarly to the physical version [1], we rewrite the
masking logic to account for FEAT_SEL2.

The interrupt table is taken from the Arm architecture reference
manual (version DDI 0487H.a, section D1.3.6, table R_BKHXL)

[1]: https://github.com/gem5/gem5/pull/430

Change-Id: Icb6eb1944d8241293b3ef3c349b20f3981bcc558
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Giacomo Travaglini
49d07578de arch-arm: Call take(Virtual)Int only when needed
There is no need to call the methods for every kind
of interrupt. A pending one should short-circuit the
remaining checks

Change-Id: I2c9eb680a7baa4644745b8cbe48183ff6f8e3102
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Giacomo Travaglini
bb323923f2 arch-arm: Simplify get/checkInterrupts with takeVirtualInt
With this patch we align virtual interrupts with respect to
the physical ones by introducing a matching takeVirtualInt
method.

Change-Id: Ib7835a21b85e4330ba9f051bc8fed691d6e1382e
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Giacomo Travaglini
3d41339366 arch-arm: Fix ISR_EL1 register read in secure mode
Vitual interrupts are enabled in secure mode as well
after the introduction of FEAT_SEL2. Replacing the
secure mode check with the EL2Enabled one

Change-Id: Id685a05d5adfa87b2a366f6be42bf344168927d4
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Giacomo Travaglini
90b711e879 arch-arm: Define an ISR type register
Change-Id: I358050a507fb76654e87165720dfb3b2ea6ca838
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-21 13:39:41 +00:00
Hoa Nguyen
3009e0fb57 mem-ruby: Fix typo in CHI's Send_CompI (#579)
The destination for the response is set twice.
2023-11-20 21:38:13 -08:00
Bobby R. Bruce
f26867a075 mem-cache: Revert "Prefetchers Improvements"
Reverts PR https://github.com/gem5/gem5/pull/564

Reverts commits:

* 047a494c2b
* 2abd65c270
* 38045d7a25
* 6416304e07
* 8598764a03

Change-Id: Id523acc1778c3f827637302a6465f5a9e539d6b5
2023-11-20 19:49:04 -08:00
Vishnu Ramadas
06161ded8c dev-amdgpu: Add VMID map to checkpoint
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit checkpoints the existing VMID map so that any
new doorbells after restoration use a unique queue ID

Change-Id: I9bf89a2769db26ceab4441634ff2da936eea6d6f
2023-11-20 21:19:17 -06:00
Bobby R. Bruce
08c0d1f27a dev: Fix std::min type mismatch in reg_bank.hh
https://github.com/gem5/gem5/pull/386 included two cases in
"src/dev/reg_bank.hh" where `std:: min` was used to compare a an integer
of type `size_t` and another of type `Addr`. This cause an error on my
Apple Silicon Mac as this is a comparison between an "unsigned long"
and an "unsigned long long" which (at least on my setup) was not
permitted. To fix this issue the `reg_size` was changed from `size_t` to
`Addr`, as well as it the types of the values it was derived from and
the variable used to hold the return from the `std::min` calls.

Change-Id: I31e9c04a8e0327d4f6f5390bc5a743c629db4746
2023-11-20 17:33:44 -08:00
Vishnu Ramadas
d19d6fc31e dev-amdgpu: Add PM4 queue ID to GPU used VMID map
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit adds existing IDs to the GPU device's used
VMID map so that new doorbells are aware of existing queue IDs and use a
new ID. This ensures that queue IDs are unique after checkpoint
restoration

Change-Id: I9bf89a2769db26ceab4441634ff2da936eea6d6f
2023-11-16 17:30:00 -06:00
Jason Lowe-Power
db6a869786 mem-cache: Prefetchers Improvements (#564)
This pull request contains a set of small patches which fix some bugs in
the gem5 prefetchers, and aligns out-of-the box prefetcher performance
more closely with that which a typical user would expect.

The performance patches have been tested with an out-of-the-box
(untuned) Stride prefetcher configuration against a set of SPEC 2017
SimPoints, and show a modest IPC uplift across the board, with no IPC
degradation.

The new defaults were identified as part of work on gem5 prefetchers
undertaken by Nikolaos Kyparissas while on internship at Arm.
2023-11-16 15:22:26 -08:00
Giacomo Travaglini
4ca2efac16 mem-ruby: AtomicNoReturn should check comp_anr instead of comp_wu (#545)
The comp_anr parameter is currently unused. Both parameters (comp_wu and
comp_anr) are set to false by default

Change-Id: If09567504540dbee082191d46fcd53f1363d819f

Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
2023-11-16 15:20:51 -08:00
Matthew Poremba
4965367724 mem-ruby, gpu-compute: fix SQC/TCP requests to same line (#540)
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another.  The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$).  If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned.  At that point they can be serviced, and assuming the line
has not been evicted, they should hit.

For example, in the following test (on 1 GPU thread, in 1 WG):

load Arr[0]
load Arr[1]
load Arr[2]

The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.

However, the current support in the VIPER SQC/TCP code does not model
this correctly.  Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced.  This causes all 3 requests to be
classified as misses.

To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from
being
sent in parallel with the original one.  To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line.  If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP.  Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.

As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).

As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests).  This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.

Although the SQC does not support stores, the TCP does.  Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).

Additionally, making these changes introduced issues with deadlocks at
the TCC.  Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.

After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed.  This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to).  Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores.  Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request.  Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.

However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel.  Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.

Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-16 14:24:00 -08:00
Bobby R. Bruce
bfe899e48e stdlib, resources: Update JSON data in workload (#532)
- resources field in workload now supports a dict with resources id and
version.

- Older workload JSON are still supported but added a deprecation waring
2023-11-16 10:11:13 -08:00
David Schall
94879c2410 cpu: Require BTB hit to detect branches.
In a high performance CPU there is no other way than a BTB hit
to know about a branch instruction and its type. For low-end CPU's
pre-decoding might sit in from of the BPU to provide this information.
Currently, the BPU models only low-end behavior and updates the
RAS and the indirect branch prediction even without a BTB hit.
This patch adds two things to model the correct behavior for high-end
CPUs.
1. A check before the RAS and indirect predictor wheather there was
a BTB hit or not. Only for BTB hits the BPU will consolidate RAS, and
indirect predictor.
2. Since, this check requires a BTB hit for indirect branches they must
also be installed into the BTB. For returns this was already done.

Change-Id: Ibef9aa890f180efe547c82f41fc71f457c988a89
Signed-off-by: David Schall <david.schall@ed.ac.uk>
2023-11-16 12:35:10 +00:00
Giacomo Travaglini
047a494c2b mem-cache: Optimize strided prefetcher address generation
This commit optimizes the address generation logic in the strided
prefetcher by introducing the following changes

(d is the degree of the prefetcher)

* Evaluate the fixed prefetch_stride only once (and not d-times)
* Replace 2d multiplications (d * prefetch_stride and distance *
prefetch_stride) with additions by updating the new base prefetch
address while looping

Change-Id: I49c52333fc4c7071ac3d73443f2ae07bfcd5b8e4
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-by: Richard Cooper <richard.cooper@arm.com>
Reviewed-by: Tiberiu Bucur <tiberiu.bucur@arm.com>
2023-11-16 09:48:15 +00:00
Nikolaos Kyparissas
2abd65c270 mem: added distance parameter to stride prefetcher
The Stride Prefetcher will skip this number of strides ahead of the
first identified prefetch, then generate `degree` prefetches at
`stride` intervals. A value of zero indicates no skip (i.e. start
prefetching from the next identified prefetch address).

This parameter can be used to increase the timeliness of prefetches by
starting to prefetch far enough ahead of the demand stream to cover
the memory system latency.

[Richard Cooper <richard.cooper@arm.com>:
- Added detail to commit comment and `distance` Param documentation.
- Changed `distance` Param from `Param.Int` to `Param.Unsigned`.
]

Change-Id: I6c4e744079b53a7b804d8eab93b0f07b566f0c08
Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Signed-off-by: Richard Cooper <richard.cooper@arm.com>
2023-11-16 09:48:09 +00:00
Yu-Cheng Chang
ceabe86b31 arch-riscv: Add overrides to RISC-V Interrupts class (#568) 2023-11-15 18:36:15 -08:00
Matt Sinclair
c3326c78e6 mem-ruby, gpu-compute: fix SQC/TCP requests to same line
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another.  The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$).  If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned.  At that point they can be serviced, and assuming the line
has not been evicted, they should hit.

For example, in the following test (on 1 GPU thread, in 1 WG):

load Arr[0]
load Arr[1]
load Arr[2]

The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.

However, the current support in the VIPER SQC/TCP code does not model
this correctly.  Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced.  This causes all 3 requests to be
classified as misses.

To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from being
sent in parallel with the original one.  To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line.  If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP.  Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.

As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).

As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests).  This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.

Although the SQC does not support stores, the TCP does.  Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).

Additionally, making these changes introduced issues with deadlocks at
the TCC.  Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.

After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed.  This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to).  Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores.  Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request.  Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.

However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel.  Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.

Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-15 19:23:51 -06:00
Matt Sinclair
065ddf759f mem-ruby, gpu-compute: fix bug with GPU bypassing loads
The current GPU TCP (L1D$) Ruby SLICC code had a bug where a GPU
load that wants to bypass the L1D$ (e.g., GLC or SLC bit was set)
but the line is in Invalid when that request arrives, results in
a non-bypassing load being sent to the GPU TCC (L2$) instead of
a bypassing load.

This issue was not caught by currently nightly or weekly tests,
because the tests do not test for correctness in terms of hits
and misses in the caches.  However, tests for these corner cases
expose this issue.

To fix, this, this patch removes the check that the entry is valid
when deciding what to do with a bypassing GPU load -- since the
TCP Ruby code has transitions for bypassing loads in both I and V,
we can simply call the LoadBypassEvict event in both cases and the
appropriate transition will handle the bypassing load given the
cache line's current state in the TCP.

Change-Id: Ia224cefdf56b4318b2bcbd0bed995fc8d3b62a14
2023-11-15 19:23:51 -06:00
hungweihsuG
83f1fe3fec dev: add debug flag in register bank. (#386)
Print extra logs for the full/partial read/write access to the registers
through the register bank. The debug flag is empty by default and would
not print anything.

Test: run unittest of dev/reg_bank.test.xml to check the behavior would
not affect the original functionality.
run gem5 with debug flags and use m5term to poke on registers.
2023-11-15 10:04:46 -08:00
wmin0
a8440f367d arch-riscv: Move fault handler addr logic to ISA (#554)
mtvec.mode is extended in the new riscv proposal, like fast interrupt.
This change moves that part from Fault class to ISA class for
extendable.

Ref: https://github.com/riscv/riscv-fast-interrupt
2023-11-15 10:04:01 -08:00