Commit Graph

277 Commits

Author SHA1 Message Date
Matthew Poremba
6a9e80c54c gpu-compute: Support for MI200 GPU model (#733) 2024-01-15 08:18:34 -08:00
Matt Sinclair
ab9e61ea03 gpu-compute: WAX dependency detection (#731)
WAX Dependencies would be missed if a RAW Dependency also existed.
2024-01-05 12:57:24 -06:00
Matt Sinclair
dc85d1492c gpu-compute: Added register file cache support (#730)
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altered to increase or decrease
RFC latency benefit.
2024-01-05 12:57:06 -06:00
KaiBatley
359ac63280 gpu-compute: Added register file cache support
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altrered to increase or
decrease RFC latency benefit.

Change-Id: I6f5bf5b750eb64155fbc8c8343e9feadce5c9f79
2024-01-04 22:43:05 -06:00
KaiBatley
55fce58c19 gpu-compute: WAX dependency detection
WAX Dependencies would be missed if a RAW Dependency also existed.

Change-Id: I2a9e50b9d0540a30de9c1bf6bb544c7b9654cb29
2024-01-03 22:02:02 -06:00
Matthew Poremba
cc75281802 gpu-compute: Update code object to latest LLVM
The AMDKernelCode struct is very outdated. Most of the fields are no
longer used and have been replaced with new fields that are used.
Therefore in order to support the new fields the code object needs to be
updated. The new structure is based on the table located at
https://llvm.org/docs/AMDGPUUsage.html#code-object-v3-kernel-descriptor

Most notably this adds the new compute_pgm_rsrc3 and kernarg preload
fields which are new features in gfx90a (MI200). The accum_offset in
compute_pgm_rsrc3 and kergarg preload values are necessary to run
application which enable those features and therefore a way to check
their values is needed.

Also noteable is the removal of enable_sgpr_workgroup_id_{X,Y,Z}. These
seem to be unused in all versions of ROCm that gem5 supports and
therefore these fields can be removed. They are replaced with a reserved
field in the new code object.

Change-Id: I5542442e1e5961b05e17affad0adb5186d6d9d1a
2024-01-03 15:41:06 -06:00
Matthew Poremba
8c016ebbbc gpu-compute: Implement packed workitem ABI init
This initialization method is used in gfx90a (MI200). Rather than using
three VGPRs for X,Y,Z dimensions of the kernel, pack them into one
register with 10-bits for each dimensions.

Change-Id: I8e5b681c8287779ff9f80451d6028e862322294a
2024-01-03 10:40:34 -06:00
Matthew Poremba
5e45233484 gpu-compute: Add gfx version to HSA task entry
The version is necessary for determining the correct ABI init process.
Add it to the task queue so it is accessible when doing ABI init.

Change-Id: If77434b0f93614057b5c40fcf612d59b54e05dbb
2024-01-03 10:40:34 -06:00
Bobby R. Bruce
d11c40dcac misc: Run pre-commit run --all-files
This ensures `isort` is applied to all files in the repo.

Change-Id: Ib7ced1c924ef1639542bf0d1a01c5737f6ba43e9
2023-11-29 22:06:41 -08:00
Gabe Black
db3a6e8e84 scons: Use Kconfig to configure gem5.
These are not yet consumed by anything, but convert all the settings
from SCons variables to Kconfig variables.

If you have existing SConsopts files which need to be converted, you
should take a look at KCONFIG.md to learn about how kconfig is used in
gem5. You should decide if any variables need to be available to C++ or
kconfig itself, and whether those are options which should be detected
automatically, or should be up to the user. Options which should be
measured automatically should still be in SConsopts files, while user
facing options should be added to new or existing Kconfig files.

Generally, make sure you're storing c++/kconfig visible options in
env['CONF'][...]. Also remove references to sticky_vars since persistent
options should now be handled with kconfig, and export_vars since
everything in env['CONF'] is now exported automatically.

Switch SCons/gem5 to use Kconfig for configuration, except EXTRAS which
is still a sticky SCons variable. This is necessary because EXTRAS also
controls what config options exist. If it came from Kconfig itself, then
there would be a circular dependency. This dependency could
theoretically be handled by reparsing the Kconfig when EXTRAS
directories were added or removed, but that would be complicated, and
isn't supported by kconfiglib. It wouldn't be worth the significant
effort it would take to add it, just to use Kconfig more purely.

Change-Id: I29ab1940b2d7b0e6635a490452d05befe5b4a2c9
2023-11-23 08:26:10 +08:00
Matthew Poremba
e362310f3d gpu-compute: Update GPR allocation counts
GPR allocation is using fields in the AMD kernel code structure which
are not backwards compatible and are not populated in more recent
compiler versions. Use the granulated fields instead which is enfored to
be backwards compatible.

Change-Id: I718716226f5dbeb08369d5365d5e85b029027932
2023-11-01 14:52:39 -05:00
Matthew Poremba
f07e0e7f5d gpu-compute: Read dispatch packet with timing DMA
This fixes occasional readBlob fatals caused by the functional read of
system memory, seen often with the KVM CPU.

Change-Id: Ifccee666f62faa5b2fcf0a64a9d77c8cf95b3add
2023-11-01 14:52:39 -05:00
Matthew Poremba
d05433b3f6 gpu-compute,dev-hsa: Send vendor packet completion signal
gem5 does not currently implement any vendor-specific HSA packets.
Starting in ROCm 5.5, vendor packets appear to end with a completion
signal. Not sending this completion causes gem5 to hang. Since these
packets are not documented anywhere and need to be reverse engineered we
send the completion signal, if non-zero, and finish the packet as is the
current behavior.

Testing: HIP examples working on most recent ROCm release (5.7.1).

Change-Id: Id0841407bec564c84f590c943f0609b17e01e14c
2023-11-01 14:52:39 -05:00
Matthew Poremba
da11427ba6 gpu-compute: Update tokens for flat global/scratch (#408)
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.

The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.

To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.

The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.

Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
2023-10-11 09:00:10 -07:00
Matthew Poremba
9f4d334644 gpu-compute: Update tokens for flat global/scratch
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.

The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.

To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.

The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.

Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
2023-10-10 09:48:16 -05:00
Matthew Poremba
6a4b2bb096 dev-hsa,gpu-compute: Add timestamps to AMD HSA signals
The AMD specific HSA signal contains start/end timestamps for dispatch
packet completion signals. These are current always zero. These
timestamp values are used for profiling in the ROCr runtime.
Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check
for zero values before dividing, causing applications that use profiling
to crash with SIGFPE. Profiling is used via hipEvents in the HACC
application, so these should be supported in gem5.

In order to handle writing the timestamp values, we need to DMA the
values to memory before writing the completion signal. This changes the
flow of the async completion signal write to be (1) read mailbox pointer
(2) if valid, write the mailbox data, other skip to 4 (3) write mailbox
data if pointer is valid (4) write timestamp values (5) write completion
signal. The application will process the timestamp data as soon as the
completion signal is received, so we need to ordering to ensure the DMA
for timestamps was completed.

HACC now runs to completion on GPUFS and has the same output was
hardware.

Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e
2023-10-06 13:21:40 -05:00
Matthew Poremba
2b97f17fe1 gpu-compute: Fix dynamic scratch size test
ROCm supports dynamically allocating scratch space, which resides in
framebuffer memory, to reduce the amount of memory allocated for kernels
that have not yet launched. The size of the scratch space allocated is
located in task->amdQueue.compute_tmpring_size_wavesize. This size is in
kilobytes. The AQL task contains the number of bytes requested *per work
item*, however we currently check if there is enough tmpring space by
comparing a single work item. This should instead check the size *per
wavefront*.

This causes problems in applications where multiple kernels use dynamic
scratch allocation and a later kernel requires more space than the
earlier kernel. The only application being tested that does this is
LULESH. This was resulting in the scratch space being too small,
resulting in workgroups clobbering each other's private memory leading
to some nasty bugs. It is fixed by this patch as task->amdQueue will be
re-read from the host and will contain the updated tmpring size. After
this there is enough scratch space and LULESH makes forward progress.

Change-Id: Ie9e0f92bb98fd3c3d6c2da3db9ee65352f9ae070
2023-10-04 09:38:31 -05:00
Matthew Poremba
cfa833a97d gpu-compute: Set LDS/scratch aperture base register
Starting with gfx900 (Vega) the LDS and scratch apertures can be queried
using a new s_getreg_b32 instruction. If the instruction is called with
the SH_MEM_BASES argument it returns the upper 16 bits of a 64 bit
address for the LDS and scratch apertures. The current addresses cannot
be encoded in this register, so that addresses are changed to have the
lower 48 bits be all zeros in addition to writing the bases register.

Change-Id: If20f262b2685d248afe31aa3ebb274e4f0fc0772
2023-08-31 11:01:32 -05:00
Matthew Poremba
60f071d09a gpu-compute,arch-vega: Implement flat scratch insts
Flat scratch instructions (aka private) are the 3rd and final segment of
flat instructions in gfx9 (Vega) and beyond. These are used for things
like spills/fills and thread local storage. This commit enables two
forms of flat scratch instructions: (1) flat_load/flat_store
instructions where the memory address resolves to private memory and (2)
the new scratch_load/scratch_store instructions in Vega. The first are
similar to older generation ISAs where the aperture is unknown until
address translation. The second are instructions guaranteed to go to
private memory.

Since these are very similar to flat global instructions there are
minimal changes needed:

- Ensure a flat instruction is either regular flat, global, XOR scratch
- Rename the global op_encoding methods to GlobalScratch to indicate
  they are for both and are intentionally used.
- Flat instructions in segment 1 output scratch_ in the disassembly
- Flat instruction executed as private use similar mem helpers as global
- Flat scratch cannot be an atomic

This was tested using a modified version of the 'square' application:

template <typename T>
__global__ void
scratch_square(T *C_d, T *A_d, size_t N)
{
    size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
    size_t stride = blockDim.x * gridDim.x ;

    volatile int foo; // Volatile ensures scratch / unoptimized code

    for (size_t i=offset; i<N; i+=stride) {
        foo = A_d[i];
        C_d[i] = foo * foo;
    }
}

Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d
2023-08-26 13:40:12 -05:00
Matthew Poremba
4506188e00 gpu-compute: Fix private offset/size register indexes
According to the ABI documentation from LLVM, the *low* register of flat
scratch (maxSGPR - 4) is the offset and the high register (maxSGPR - 3)
is size. These are currently backwards, resulting in some gnarly
addresses being generated leading to page fault and/or incorrect data.

This commit fixes this by setting the order correctly.

Change-Id: I0b1d077c49c0ee2a4e59b0f6d85cdb8f17f9be61
2023-08-26 13:40:12 -05:00
Matthew Poremba
e0379f4526 gpu-compute: Fix flat scratch resource counters
Flat instructions may access memory locations in LDS (scratchpad) and
global (VRAM/framebuffer) and therefore increment both counters when
dispatched. Once the aperture is known, we decrement the counters of the
aperture that was *not* used. This is done incorrectly for scratch /
private flat instruction. Private memory is global and therefore local
memory counters should be decremented.

This commit fixes the counters by changing the global decrements to
local decrements.

Change-Id: I25890446908df72e5469e9dbaba6c984955196cf
2023-08-26 13:40:12 -05:00
Matthew Poremba
57b3d2897c gpu-compute: Use timing DMAs for GPUFS HSA signals
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.

This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.

Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
2023-08-25 13:10:51 -05:00
Matthew Poremba
90a518e885 gpu-compute,arch-vega: Fix ALU-only LDS counters
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.

This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.

Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c
2023-08-23 19:30:24 -05:00
Matthew Poremba
df4739929d gpu-compute: Change kernel-based exit location
The previous exit event occurs when the dispatcher sends a completion
signal for a kernel, but gem5 does some kernel-based stats updates after
the signal is sent. Therefore, if these exit events are used as a way to
dump per-kernel stats, some of the stats for the kernel that just ended
will be in the next kernel's stat dump which is misleading.

This patch moves the exit event to where the stats are updated and only
exits if the dispatcher has requested a stat dump to prevent situations
where stats are updated mid-kernel.

Change-Id: I74dc1cad5fc90382a2a80564764b3e7c9fb65521
2023-08-15 11:06:26 -05:00
KaiBatley
efa1d87add configs: fix GPU's default number of HW barrier/CU (#92)
AMD GCN3 and Vega GPUs assume a max of 16 WG/CU.  Any GPU WG with more
than 1 WF requires a hardware barrier to allow WFs in the WG to
synchronize locally.  However, currently the default gem5 GPU
configuration assumes only 4 barriers per CU, which artificially
prevents applications with > 4 WG/CU that could run simultaneously
from running simultaneously.

This fix resolves this by updating the default number of hardware barriers
per CU to 16, which mimics the support described in slide 39 here:
https://www.olcf.ornl.gov/wp-content/uploads/2019/10/
ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf

Change-Id: Ib7636a13359d998e676c1790f436a83ce88cbfc0
2023-07-17 10:42:40 -07:00
Jason Lowe-Power
442923c414 Add feature to output citations automatically based on configuration (#90)
This change adds a new file to m5out which is citations.bib.
This file will contain the citations to the papers which describe the
aspects of the gem5 simulator that the simulation uses. In other words,
each simulation configuration could generate a different bib file
referencing different works.

Each SimObject can now have a set of citations associated with it. After
the system is built (in `instantiate`), the citations.bib file is
created by parsing all SimObjects that have been instantiated and taking
the union of their associated citations.

This commit is not meant to add all citations, but to act as an example
for others to add more citations to gem5.

Change-Id: Icd5c46fd9ee44adbeec1fea162657f5716f7e5ef
Signed-off-by: Jason Lowe-Power <jason@lowepower.com>
2023-07-17 10:41:51 -07:00
Matthew Poremba
3756af8ed9 gpu-compute,configs: Make sim exits conditional
The unconditional exit event when a kernel completes that was added in
c644eae2dd is causing scripts that do not
ignore unknown exit events to end simulation prematurely. One such
script is the apu_se.py script used in SE mode GPU simulation. Make this
exit conditional to the parameter being set to a valid value to avoid
this problem.

Change-Id: I1d2c082291fdbcf27390913ffdffb963ec8080dd
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/72098
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-07-07 14:12:54 +00:00
Matthew Poremba
c644eae2dd configs,gpu-compute: Kernel dispatch-based exit events
Add two kernel dispatch-based exit events that are useful for limiting
the simulation and enabling debug flags at specific GPU kernels. Since
the KVM CPU typically used with GPUFS is not deterministic, this help
with enabling debug flags when the Tick number may vary. The exit at GPU
kernel option can also limit simulation by only simulating a few hundred
kernels, for example, and exit at a determined point.

Change-Id: I81bae92a80c25fc38c41e999aa662e1417b7a20d
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71418
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
2023-06-08 22:03:47 +00:00
Matthew Poremba
ebd5b3e4ae gpu-compute: Gfx version check for FS and SE mode
There is no GPU device in SE mode to get version from and no GPU driver
in FS mode to get version from, so a conditional needs to be added
depending on the mode to get the gfx version.

Change-Id: I33fdafb60d351ebc5148e2248244537fb5bebd31
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71078
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
2023-06-01 00:15:02 +00:00
Matthew Poremba
6b4a1020be configs,dev-amdgpu: GPUFS MI200/gfx90a support
Add support for MI200-like device. This includes adding PCI IDs and new
MMIOs for the device, a different MAP_PROCESS packet, and a different
calculation for the number of VGPRs.

Change-Id: I0fb7b3ad928826beaa5386d52a94ba504369cb0d
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70317
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-05-25 19:14:32 +00:00
Giacomo Travaglini
7b39a7f14e misc: Rename DEBUG macro into GEM5_DEBUG
The DEBUG macro is not part of any compiler standards (differently from
NDEBUG, which elides assertions).

It is only meant to differentiate gem5.debug from .fast and .opt builds.
gem5 developers have used it to insert helper code that is supposed to
aid the debugging process in case anything goes wrong.

This generic name is likely to clash with other libraries linked with
gem5.  This is the case of DRAMSim as an example.

Rather than using undef tricks, we just inject a GEM5_DEBUG macro
for gem5.debug builds.

Change-Id: Ie913ca30da615bd0075277a260bbdbc397b7ec87
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69079
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Daniel Carvalho <odanrc@yahoo.com.br>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
2023-03-21 06:53:55 +00:00
Gabriel Busnot
7f4c92c910 mem,arch-arm,mem-ruby,cpu: Remove use of deprecated base port owner
Change-Id: I29214278c3dd4829c89a6f7c93214b8123912e74
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67452
Reviewed-by: Bobby Bruce <bbruce@ucdavis.edu>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Daniel Carvalho <odanrc@yahoo.com.br>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Reviewed-by: Daniel Carvalho <odanrc@yahoo.com.br>
2023-02-03 06:11:45 +00:00
Vishnu Ramadas
d6bbccb60a gpu-compute : Fix incorrect TLB stats when FunctionalTLB is used
When FunctionalTLB is used in SE mode, the stats tlbLatency and
tlbCycles report negative values. This patch fixes it by disabling the
updates that result in negative values when FunctionalTLB is set to true

Change-Id: I6962785fc1730b166b6d5b879e9c7618a8d6d4b3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67202
Reviewed-by: Matt Sinclair <mattdsinclair.wisc@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matthew Poremba <matthew.poremba@amd.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2023-01-10 02:27:29 +00:00
Matthew Poremba
af2cecf59e gpu-compute: Fix ABI init for DispatchId
DispatchId should allocate two SGPRs instead of one. Allocating one was
causing all subsequent SGPR index values to be off by one, leading to
bad addresses for things like flat scratch and private segment. This
field is not used very often so it was not impacting most applications.

Change-Id: I17744e2d099fbc0447f400211ba7f8a42675ea06
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66711
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-12-16 18:16:18 +00:00
Hoa Nguyen
eac06ad681 python: Fix multiline quotes in a single line
An example case,
```python
mem_side_port = RequestPort(
    "This port sends requests and " "receives responses"
)
```

This is the residue of running the python formatter.
This is done by finding all tokens matching the regex `"\s"(?![.;"])`
and manually replacing them by empty strings.

Change-Id: Icf223bbe889e5fa5749a81ef77aa6e721f38b549
Signed-off-by: Hoa Nguyen <hoanguyen@ucdavis.edu>
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66111
Reviewed-by: Bobby Bruce <bbruce@ucdavis.edu>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-11-29 23:44:38 +00:00
vramadas95
dff879cf21 configs, gpu-compute: Add configurable L1 scalar latencies
Previously the scalar cache path used the same latency parameter as the
vector cache path for memory requests. This commit adds new parameters
for the scalar cache path latencies. This commit also modifies the model
to use the new latency parameter to set the memory request latency in
the scalar cache. The new paramters are '--scalar-mem-req-latency' and
'--scalar-mem-resp-latency' and are set to default values of 50 and 0
respectively

Change-Id: I7483f780f2fc0cfbc320ed1fd0c2ee3e2dfc7af2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65511
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
2022-11-12 02:23:02 +00:00
Matthew Poremba
8d63c9fc06 gpu-compute: Add granulated SGPR computation for gfx9
The granulated SGPR size is used when the number of SGPRs is unknown.
The computation for this has changed since gfx8 and is commented as a
TODO in a comment.

This changeset implements the change and also checks for an invalid SGPR
count. According to LLVM code this could happen "due to a compiler bug
or when using inline asm.":
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/
    AMDGPUAsmPrinter.cpp#L723

Change-Id: Ie487a53940b323a0002341075e0f81af4147a7d8
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65252
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-11-08 21:34:11 +00:00
Matthew Poremba
f6dc5c6aa4 gpu-compute: Chunkify AMDKernelCode read from device
The AMDKernelCode object can span potentially span two pages. Currently
the copy loop from device memory only translates once at the base
address.

This changeset translates one cache line at a time before copying and
has the ancillary benefit for cleaning up this code a bit.

Change-Id: I602bc12d8f8c5d3a3e57ab3f42f7dd3df58dc144
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65251
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
2022-11-08 21:34:11 +00:00
Matthew Poremba
ec696c00b2 gpu-compute: Add missing initial reg state in WF
There are two initial scalar register fields that are not initialized in
the wavefront when a task is dispatch. This changeset adds the missing
DispatchId and PrivateSegSize fields. These fields are typically used
when an application is compiled with debug support and are typically not
used in the applications in gem5's test suite.

Change-Id: I5b5fa75e4badfd9ba7588e4cd485ebf75fd5d627
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/64191
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-10-06 23:14:40 +00:00
Matthew Poremba
9f5c0f2822 gpu-compute: dprint instruction requesting translation
When debugging strange addresses, it is extremely useful to know *what*
instruction calculated that address. This make it much easier to follow
assembly code backwards to find the source of an incorrect address.

This change adds a DPRINTF for GPUTLB that by default prints the
disassembly when a virtual address translation is sent to the TLB.

Change-Id: I5066c064a48c5c48696863eeccd8d011245ef7b2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63176
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-09-09 04:13:49 +00:00
Gabe Black
f4209bbdee misc: Remove lingering uses of TheISA::.
Change-Id: Ie55e0d79867fbc8f75a993fb456a58c84de5def4
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62196
Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Giacomo Travaglini <giacomo.travaglini@arm.com>
2022-08-20 07:30:16 +00:00
Alexandru Dutu
c6b38909e1 gpu-compute: Adding support for LDS atomics
This changeset is adding support for LDS atomics
and implementing DS_OR_B32 instruction.

Change-Id: I84c5cf6ce0e9494726dc7299f360551cd2a485f5
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61791
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-08-19 16:44:31 +00:00
Bobby R. Bruce
787204c92d python: Apply Black formatter to Python files
The command executed was `black src configs tests util`.

Change-Id: I8dfaa6ab04658fea37618127d6ac19270028d771
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/47024
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-08-03 09:10:41 +00:00
Matthew Poremba
68115460d8 gpu-compute: Set LDS and Scratch apertures in FS
The LDS and scratch aperture base and limits are hardcoded to some
values that are useful for SE mode. In reality, these are chosen by the
driver so we need to honor whatever values the driver passes so that
when addresses are calculated they fall into the correct aperture to
route flat instructions to those apertures.

This overwrites the default hardcoded values for LDS and scratch base
and limit using the values providing by the driver in a MAP_PROCESS
packet.

Change-Id: I0e194a26631f697819d8aaecf1bf346a7b7c7026
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61656
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
2022-07-28 14:10:33 +00:00
Matthew Poremba
f65f5a8981 gpu-compute,arch-vega: Overhaul HWRegs, setreg, getreg
These instructions are supposed to be read/writing special shader
hardware registers. Currently they are getting/setting to an SGPR. This
results in getting incorrect registers at best and clobbering an SGPR
being used by an application at worst. Furthermore, some registers need
to be set in the shader and the application will never (can never) set
them.

This patch overhauls the getreg/setreg instructions to use different
storage in the shader. The values will be updated either via setreg from
an application (e.g., mode register) or set by a PM4 MAP_PROCESS.

Change-Id: Ie5e5d552bd04dc47f5b35b5ee40a569ae345abac
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61655
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
2022-07-28 14:10:33 +00:00
Matthew Poremba
ee75e19b8b gpu-compute: Fix dynamic scratch allocation on GPUFS
When GPU needs more scratch it requests from the runtime. In the
method to wait for response, a dmaReadVirt is called with the same
method as the callback with zero delay. This means that effectively
there is an infinite loop in the event queue if the scratch setup is not
successful on the first attempt. In the case of GPUFS, it is never
successfully instantly so a delay must be added. Without added delay,
the host CPU is never scheduled to make progress setting up more scratch
space.

The value 1e9 is choosen to match the KVM quantum and hopefully give KVM
a chance to schedule an event. For reference, the driver timeout is
200ms so this is still fairly aggressive checking of the signal response.
This value is also balanced around the GPUCommandProc DPRINTF to
prevent the print in this method from overwhelming debug output.

Change-Id: I0e0e1d75cd66f7c47815b13a4bfc3c0188e16220
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61651
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
2022-07-28 14:10:33 +00:00
Gabe Black
397e66d8b6 gpu-compute: Stop passing in a default value to resize().
A default constructed element of the container is the default value to
the second resize() parameter. Having that parameter explicitly causes a
warning/error in newer versions of gcc, and is unnecessary regardless.

Change-Id: I6aee2d23f0f4382b00caf552c8e38940614c5f9a
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/60311
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Gabe Black <gabe.black@gmail.com>
Reviewed-by: Yu-hsin Wang <yuhsingw@google.com>
2022-06-03 10:48:31 +00:00
Matthew Poremba
8fe975e57e gpu-compute: Fatal on dynamic scratch allocation in GPUFS
This is known not working in GPUFS. As a result, the simulation will
never end. Rather than simulate forever, add a fatal for now to exit
simulation until support for this functionality is added.

Change-Id: I8e45996a7eb781575e8643baea05daf87bc5f1c3
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/58472
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-04-08 17:12:32 +00:00
Matthew Poremba
e36a8dbd8a gpu-compute: Handle GPUFS system store responses
Requests in GPUFS which go to system memory will not generate the
WriteCompleteResp packets that the VIPER protocol would normally created
for device requests which go through the caches. Therefore, we need to
callback the GM pipe handleResponse to complete the access and make
forward progress.

Change-Id: Ic00c430ce420a591fe5743f758b780d93afd2a38
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57989
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-04-07 20:11:01 +00:00
Matthew Poremba
6feaa88e27 gpu-compute: Command processor read path from device
In full system mode, the AMDKernelCode object can reside in either the
system memory or in the dGPU device memory. Currently only reading from
the host/system memory is supported. This adds the necessary code to
read from the dGPU device memory.

Change-Id: I887fc706b3f9834db14e40f36fd29dd3d4602925
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57710
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
2022-04-07 20:11:01 +00:00