Commit Graph

306 Commits

Author SHA1 Message Date
Marco Kurzynski
a8447b7fc0 arch-vega: Pass s_memtime through smem pipe (#1350)
The Vega ISA's s_memtime instruction is used to obtain a cycle value
from the GPU. Previously, this was implemented to obtain the cycle count
when the memtime instruction reached the execute stage of the GPU
pipeline. However, from microbenchmarking we have found that this under
reports the latency for memtime instructions relative to real hardware.
Thus, we changed its behavior to go through the scalar memory pipeline
and obtain a latency value from the the SQC (L1 I$). This mirrors the
suggestion of the AMD Vega ISA manual that s_memtime should be treated
like a s_load_dwordx2.

The default latency was set based on microbenchmarking.

Change-Id: I5e251dde28c06fe1c492aea4abf9f34f05784420
2024-08-26 19:47:04 -07:00
Matthew Poremba
7d46c50663 arch-vega: Swizzle multi-dword scratch requests (#1445)
Scratch memory requests that are larger than one dword are using a
different memory layout than global instructions. Rather than being
placed contiguously, each dword is interleaved 64 lanes * 4 bytes away
as described in Section 9.1.5.2. "Swizzled Buffer Addressing" in the
MI300 specification. This was verified by comparing MI300 output (which
uses scratch_ instructions) with MI200 (which uses buffer instructions).
MI300 FashionMNIST bs=1 now matches CPU reference.

This requires several changes to the instruction implementations:
- For stores, data in the GPUDynInst can be swizzled before the data is
written to memory. This is easy to do using a helper method. This is
done in the template<int N> variant of initMemWrite. To use this x2
stores are changed to use template<int N> rather than loading a U64. The
swizzle function is renamed to swizzleAddr to avoid confusion with
swizzleData.
- For loads, data is unswizzled in completeAcc when writing register
values. This is not as easy to implement as a helper and is thus
implemented for the three load instructions that load more than one
dword.
- Accessing swizzled data requires at least one packet per dword. A new
GPU memory helper is added to create these packets for scratch requests
specifically. This is called in the template<int N> variant of
initMemRead / initMemWrite. Loads and stores of x2 are changed to use
this variant instead of accessing a U64.

The GPUDynInst status vector restrictions are increased to allow for
swizzled x4 accesses. For simplicity this does not currently support
misaligned swizzled accesses and will panic upon seeing such a case.

Change-Id: Ic686c51e28e0af029a043d5a5b3d4069f2cb94f9
2024-08-12 06:58:48 -07:00
Matthew Poremba
84fedecafe gpu-compute: Update Requests for invalidations
The SQC and TCC invalidations share a Request pointer which they both
modify. This can cause some problems, so use a different request pointer
for each invalidate. The setContext call is also removed as the value
being assigned to it is uninitialized.

Change-Id: I82ea7aa44a4f4515c1560993caa26cc6a89355af
2024-08-07 14:37:49 -07:00
Matt Sinclair
edd73bd330 gpu-compute: fix typo in GPUMem debug print (#1412)
The GPUMem print for when a memstatus request completes accidentally put
a newline before the word "complete", causing complete to print on a
newline and cause confusion. This commit resolves that.
2024-08-05 12:44:13 -07:00
Matt Sinclair
ba455e2025 gpu-compute: update GPUKernelInfo print to print WG number (#1413)
Whenever a GPU kernel is launching a new WG, the GPUKernelInfo debug
flag will print that the kernel is being launched, without the context
of which WG from that kernel is being launched. This has caused some
confusion to users, who think the entire kernel is being launched
repeatedly. To resolve this confusion, update this print to make it
clear which WG is being launched when this print is enabled.
2024-08-05 12:43:41 -07:00
TiredTumblrina
9fb0b18863 gpu-compute,mem,systemc: This commit corrects typos of 'cache' (#1263)
I noticed while using the stable branch that there were a few typos of
the word 'cache' and so I've corrected a few files where I found such
typos.

Change-Id: I7c7f64812039f34fe39d0c45c4f5ce921cba06d0
2024-06-20 09:45:13 -07:00
Matthew Poremba
2b0ca93517 gpu-compute: Fix architected flat scratch
Currently writing to SRF which is incorrect, as the physical register
number can be clobbered by another wavefront if registers get renamed to
the physical register number.

Fix this by actually architecting the register, i.e., there is a
dedicated "hardware" register in the wavefront class.

Change-Id: I94e9e463eed348b2928cae884c1c20566c00984d
2024-06-15 15:46:33 -07:00
Matthew Poremba
f91d14fe46 gpu-compute: Add MFMA stats (#1248)
Add dynamic instruction counts for MFMAs.

Change-Id: I976b01344577cf011aeb3dd648a8c0017281c4e3
2024-06-15 13:04:00 -07:00
Matthew Poremba
c1803eafac arch-vega: Architected flat scratch and scratch insts
Architected flat scratch is added in MI300 which store the scratch base
address in dedicated registers rather than in SGPRs. These registers are
used by scratch_ instructions. These are flat instruction which
explicitly target the private memory aperture. These instructions have a
different address calculation than global_ instructions.

This change implements architected flat scratch support, fixes the
address calculation of scratch_ instructions, and implements decodings
for some scratch_ instructions. Previous flat_ instructions which happen
to access the private memory aperture have no change in address
calculation. Since scratch_ instructions are identical to flat_
instruction except for address calculation, the decodings simply reuse
existing flat_ instruction definitions.

Change-Id: I1e1d15a2fbcc7a4a678157c35608f4f22b359e21
2024-05-16 09:23:03 -07:00
Matthew Poremba
8be5ce6fc9 dev-amdgpu,configs,gpu-compute: Add gfx942 version
This is the version for MI300. For the most part, it is the same as
MI200 with the exception of architected flat scratch (not yet
implemented in gem5) and therefore a new version enum is required.

Change-Id: Id18cd7b57c4eebd467c010a3f61e3117beb8d58a
2024-05-15 12:08:41 -07:00
Matthew Poremba
cb47755e15 gpu: Consolidated fixes for v24.0 (#1103)
Includes fixes for several bugs reported via email, self found, and
internal reports. Also includes runs through Valgrind and UBsan. See
individual commits for more details.
2024-05-06 07:35:57 -07:00
Matthew Poremba
0d3d456894 gpu-compute: Invalidate Scalar cache when SQC invalidates (#1093)
The scalar cache is not being invalidated which causes stale data to be
left in the scalar cache between GPU kernels. This commit sends
invalidates to the scalar cache when the SQC is invalidated. This is a
sufficient baseline for simulation.

Since the number of invalidates might be larger than the mandatory queue
can hold and no flash invalidate mechanism exists in the VIPER protocol,
the command line option for the mandatory queue size is removed, which
is the same behavior as the SQC.

Change-Id: I1723f224711b04caa4c88beccfa8fb73ccf56572
2024-05-06 07:35:38 -07:00
Matthew Poremba
3490d5bf18 gpu-compute: Add DebugFlag for LDS
This prints what values are read/written to LDS and the previous value
on write. This is useful for debugging problems with LDS instructions.

Change-Id: I30063327bec1a1a808914a018467d5d78d5d58b4
2024-05-03 14:31:17 -07:00
Matthew Poremba
2703fb5699 gpu-compute: Fix valgrind memleak complaints
Fixes several memory leaks, mostly of small and medium severity. Fixes
mismatched new/new[] and delete/delete[] calls.

Change-Id: Iedafc409389bd94e45f330bc587d6d72d1971219
2024-05-03 14:29:31 -07:00
Matthew Poremba
0faa9510f9 arch-vega,gpu-compute: Fix misc ubsan runtime errors
Three main fixes:
 - Remove the initDynOperandInfo. UBSAN errors and exits due to things
   not being captured properly. After a few failed attempts playing with
   the capture list, just move the lambda to a new method.
 - Invalid data type size for some thread mask instructions. This might
   actually have caused silent bugs when the thread id was > 31.
 - Alignment issues with the operands.

Change-Id: I0297e10df0f0ab9730b6f1bd132602cd36b5e7ac
2024-05-03 14:26:46 -07:00
Matthew Poremba
a03319bef7 arch-vega: Fix output warnings, gem5.fast (#1023)
Fix gem5.fast build not building when using gpu model.

Removes very spammy stat distribution bucket size prints when running
gpu model.
2024-04-15 13:18:27 -07:00
Matthew Poremba
01f2df4b8a gpu-compute: Fix stat bucket sizes
Change-Id: If30505515867a866c631cb117d3d22e19814a2f2
2024-04-13 15:51:41 -07:00
Matthew Poremba
1d64669473 mem,gpu-compute: Implement GPU TCC directed invalidate
The GPU device currently supports large BAR which means that the driver
can write directly to GPU memory over the PCI bus without using SDMA or
PM4 packets. The gem5 PCI interface only provides an atomic interface
for BAR reads/writes, which means the values cannot go through timing
mode Ruby caches. This causes bugs as the TCC cache is allowed to keep
clean data between kernels for performance reasons. If there is a BAR
write directly to memory bypassing the cache, the value in the cache is
stale and must be invalidated.

In this commit a TCC invalidate is generated for all writes over PCI
that go directly to GPU memory. This will also invalidate TCP along the
way if necessary. This currently relies on the driver synchonization
which only allows BAR writes in between kernels. Therefore, the cache
should only be in I or V state.

To handle a race condition between invalidates and launching the next
kernel, the invalidates return a response and the GPU command processor
will wait for all TCC invalidates to be complete before launching the
next kernel.

This fixes issues with stale data in nanoGPT and possibly PENNANT.

Change-Id: I8e1290f842122682c271e5508a48037055bfbcdf
2024-04-10 11:35:25 -07:00
Matthew Poremba
833392e7b2 mem-ruby,gpu-compute: Allow memory reqs without inst
The GPUDynInst for sending memory requests through the CUs data port
is required but only used for DPRINTFs. Relax this constraint so that
the methods can be reused for requests such as probes generated by the
GPU device.

Change-Id: I16094e400968225596370b684d6471580888d98a
2024-04-10 11:35:24 -07:00
Michael Boyer
acd9d3ff94 gpu-compute: Add support for skipping GPU kernels (#940)
gpu-compute: Add support for skipping GPU kernels

This commit adds two new command-line options:

--skip-until-gpu-kernel N
Skips (non-blit) GPU kernels until the target kernel is reached.
Execution continues normally from there. Blit kernels are not skipped
because they are responsible for copying the kernel code and metadata
for the non-blit kernels. Note that skipping kernels can impact
correctness; this feature is only useful if the kernel of interest has
no data-dependent behavior, or its data-dependent behavior is not based
on data generated by the skipped kernels.

--exit-after-gpu-kernel N
Ends the simulation after completing (non-blit) GPU kernel N.

This commit also renames two existing command-line options:
--debug-at-gpu-kernel -> --debug-at-gpu-task
--exit-at-gpu-kernel  -> --exit-at-gpu-task

These were renamed because they count GPU tasks, which include both
kernels launched by the application as well as blit kernels.

Change-Id: If250b3fd2db05c1222e369e9e3f779c4422074bc
2024-03-21 07:46:27 -07:00
Michael Boyer
ba2f5615ba gpu-compute: Support cache line sizes >64B in GPUFS (#939)
This change fixes two issues:

1) The --cacheline_size option was setting the system cache line size
but not the Ruby cache line size, and the mismatch was causing assertion
failures.

2) The submitDispatchPkt() function accesses the kernel object in
chunks, with the chunk size equal to the cache line size. For cache line
sizes >64B (e.g. 128B), the kernel object is not guaranteed to be
aligned to a cache line and it was possible for a chunk to be partially
contained in two separate device memories, causing the memory access to
fail.

Change-Id: I8e45146901943e9c2750d32162c0f35c851e09e1

Co-authored-by: Michael Boyer <Michael.Boyer@amd.com>
2024-03-20 11:09:25 -07:00
Matthew Poremba
8722aef2e2 gpu-compute: Store accum_offset from code object in WF
The accumulation offset is needed for some instructions. In order to
access this value we need to place it somewhere instruction definitions
can access. The most logical place is in the wavefront.

This commit simply copies the value from the HSA task to the wavefront
object.

Change-Id: I44ef62ef32d2421953f096c431dd758e882245b4
2024-02-26 12:54:37 -06:00
Vishnu Ramadas
85680ea58e gpu-compute: Remove unused and redundant functions
In ComputeUnit, a previous commit added a  SystemHubEvent event class to
the SQCPort. This was found to be unnecessary during the review process
and is removed in this commit. Similarly, invBuf() which was added in
FetchUnit as part of an earlier commit was found to be redundant. This
commit removes it

Change-Id: I6ee8d344d29e7bfade49fb9549654b71e3c4b96f
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
690b2b9462 gpu-compute, mem-ruby: Add comments and reformat code
Change-Id: Id2b3886dce347fdcfcad22009a42b92febc00a6c
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
7dae25e881 configs, gpu-compute: Add parameter in shader for CUs per SQC
Change-Id: If0ae0db1b6ccc08a92f169a271b137f69f410f7b
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
0e93e6142a arch-vega, gpu-compute, mem-ruby: Remove extra empty lines
Change-Id: I18770ec7e38c4a992a0ae6de95b0be49ab4426c2
2024-02-09 12:17:24 -06:00
Vishnu Ramadas
440409d807 gpu-compute: Add Icache invalidation at kernel start
Previously, the data caches were invalidated at the start of each
kernel. This commit adds support for invalidating instruction cache at
kernel launch time

Change-Id: I32e50f63fa1442c2514d4dd8f9d7689759f503d3
2024-02-09 12:16:41 -06:00
Vishnu Ramadas
03838afce0 gpu-compute: Add support for injecting scalar memory barrier
This commit adds support for injecting a scalar memory barrier in the
GPU. The barrier will primarily be used to invalidate the entire SQC
cache. The commit also invalidates all buffers and decrements related
counters upon completion of the invalidation request

Change-Id: Ib8e270bbeb8229a4470d606c96876ba5c87335bf
2024-02-09 12:14:57 -06:00
Matthew Poremba
63caa780c2 misc: Remove all references to GCN3
Replace instances of "GCN3" with Vega. Remove gfx801 and gfx803. Rename
FIJI to Vega and Carrizo to Raven.

Using misc since there is not enough room to fit all the tags.

Change-Id: Ibafc939d49a69be9068107a906e878408c7a5891
2024-01-17 11:11:06 -06:00
Matthew Poremba
6a9e80c54c gpu-compute: Support for MI200 GPU model (#733) 2024-01-15 08:18:34 -08:00
Matt Sinclair
ab9e61ea03 gpu-compute: WAX dependency detection (#731)
WAX Dependencies would be missed if a RAW Dependency also existed.
2024-01-05 12:57:24 -06:00
Matt Sinclair
dc85d1492c gpu-compute: Added register file cache support (#730)
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altered to increase or decrease
RFC latency benefit.
2024-01-05 12:57:06 -06:00
KaiBatley
359ac63280 gpu-compute: Added register file cache support
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altrered to increase or
decrease RFC latency benefit.

Change-Id: I6f5bf5b750eb64155fbc8c8343e9feadce5c9f79
2024-01-04 22:43:05 -06:00
KaiBatley
55fce58c19 gpu-compute: WAX dependency detection
WAX Dependencies would be missed if a RAW Dependency also existed.

Change-Id: I2a9e50b9d0540a30de9c1bf6bb544c7b9654cb29
2024-01-03 22:02:02 -06:00
Matthew Poremba
cc75281802 gpu-compute: Update code object to latest LLVM
The AMDKernelCode struct is very outdated. Most of the fields are no
longer used and have been replaced with new fields that are used.
Therefore in order to support the new fields the code object needs to be
updated. The new structure is based on the table located at
https://llvm.org/docs/AMDGPUUsage.html#code-object-v3-kernel-descriptor

Most notably this adds the new compute_pgm_rsrc3 and kernarg preload
fields which are new features in gfx90a (MI200). The accum_offset in
compute_pgm_rsrc3 and kergarg preload values are necessary to run
application which enable those features and therefore a way to check
their values is needed.

Also noteable is the removal of enable_sgpr_workgroup_id_{X,Y,Z}. These
seem to be unused in all versions of ROCm that gem5 supports and
therefore these fields can be removed. They are replaced with a reserved
field in the new code object.

Change-Id: I5542442e1e5961b05e17affad0adb5186d6d9d1a
2024-01-03 15:41:06 -06:00
Matthew Poremba
8c016ebbbc gpu-compute: Implement packed workitem ABI init
This initialization method is used in gfx90a (MI200). Rather than using
three VGPRs for X,Y,Z dimensions of the kernel, pack them into one
register with 10-bits for each dimensions.

Change-Id: I8e5b681c8287779ff9f80451d6028e862322294a
2024-01-03 10:40:34 -06:00
Matthew Poremba
5e45233484 gpu-compute: Add gfx version to HSA task entry
The version is necessary for determining the correct ABI init process.
Add it to the task queue so it is accessible when doing ABI init.

Change-Id: If77434b0f93614057b5c40fcf612d59b54e05dbb
2024-01-03 10:40:34 -06:00
Bobby R. Bruce
d11c40dcac misc: Run pre-commit run --all-files
This ensures `isort` is applied to all files in the repo.

Change-Id: Ib7ced1c924ef1639542bf0d1a01c5737f6ba43e9
2023-11-29 22:06:41 -08:00
Gabe Black
db3a6e8e84 scons: Use Kconfig to configure gem5.
These are not yet consumed by anything, but convert all the settings
from SCons variables to Kconfig variables.

If you have existing SConsopts files which need to be converted, you
should take a look at KCONFIG.md to learn about how kconfig is used in
gem5. You should decide if any variables need to be available to C++ or
kconfig itself, and whether those are options which should be detected
automatically, or should be up to the user. Options which should be
measured automatically should still be in SConsopts files, while user
facing options should be added to new or existing Kconfig files.

Generally, make sure you're storing c++/kconfig visible options in
env['CONF'][...]. Also remove references to sticky_vars since persistent
options should now be handled with kconfig, and export_vars since
everything in env['CONF'] is now exported automatically.

Switch SCons/gem5 to use Kconfig for configuration, except EXTRAS which
is still a sticky SCons variable. This is necessary because EXTRAS also
controls what config options exist. If it came from Kconfig itself, then
there would be a circular dependency. This dependency could
theoretically be handled by reparsing the Kconfig when EXTRAS
directories were added or removed, but that would be complicated, and
isn't supported by kconfiglib. It wouldn't be worth the significant
effort it would take to add it, just to use Kconfig more purely.

Change-Id: I29ab1940b2d7b0e6635a490452d05befe5b4a2c9
2023-11-23 08:26:10 +08:00
Matthew Poremba
e362310f3d gpu-compute: Update GPR allocation counts
GPR allocation is using fields in the AMD kernel code structure which
are not backwards compatible and are not populated in more recent
compiler versions. Use the granulated fields instead which is enfored to
be backwards compatible.

Change-Id: I718716226f5dbeb08369d5365d5e85b029027932
2023-11-01 14:52:39 -05:00
Matthew Poremba
f07e0e7f5d gpu-compute: Read dispatch packet with timing DMA
This fixes occasional readBlob fatals caused by the functional read of
system memory, seen often with the KVM CPU.

Change-Id: Ifccee666f62faa5b2fcf0a64a9d77c8cf95b3add
2023-11-01 14:52:39 -05:00
Matthew Poremba
d05433b3f6 gpu-compute,dev-hsa: Send vendor packet completion signal
gem5 does not currently implement any vendor-specific HSA packets.
Starting in ROCm 5.5, vendor packets appear to end with a completion
signal. Not sending this completion causes gem5 to hang. Since these
packets are not documented anywhere and need to be reverse engineered we
send the completion signal, if non-zero, and finish the packet as is the
current behavior.

Testing: HIP examples working on most recent ROCm release (5.7.1).

Change-Id: Id0841407bec564c84f590c943f0609b17e01e14c
2023-11-01 14:52:39 -05:00
Matthew Poremba
da11427ba6 gpu-compute: Update tokens for flat global/scratch (#408)
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.

The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.

To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.

The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.

Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
2023-10-11 09:00:10 -07:00
Matthew Poremba
9f4d334644 gpu-compute: Update tokens for flat global/scratch
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.

The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.

To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.

The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.

Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
2023-10-10 09:48:16 -05:00
Matthew Poremba
6a4b2bb096 dev-hsa,gpu-compute: Add timestamps to AMD HSA signals
The AMD specific HSA signal contains start/end timestamps for dispatch
packet completion signals. These are current always zero. These
timestamp values are used for profiling in the ROCr runtime.
Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check
for zero values before dividing, causing applications that use profiling
to crash with SIGFPE. Profiling is used via hipEvents in the HACC
application, so these should be supported in gem5.

In order to handle writing the timestamp values, we need to DMA the
values to memory before writing the completion signal. This changes the
flow of the async completion signal write to be (1) read mailbox pointer
(2) if valid, write the mailbox data, other skip to 4 (3) write mailbox
data if pointer is valid (4) write timestamp values (5) write completion
signal. The application will process the timestamp data as soon as the
completion signal is received, so we need to ordering to ensure the DMA
for timestamps was completed.

HACC now runs to completion on GPUFS and has the same output was
hardware.

Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e
2023-10-06 13:21:40 -05:00
Matthew Poremba
2b97f17fe1 gpu-compute: Fix dynamic scratch size test
ROCm supports dynamically allocating scratch space, which resides in
framebuffer memory, to reduce the amount of memory allocated for kernels
that have not yet launched. The size of the scratch space allocated is
located in task->amdQueue.compute_tmpring_size_wavesize. This size is in
kilobytes. The AQL task contains the number of bytes requested *per work
item*, however we currently check if there is enough tmpring space by
comparing a single work item. This should instead check the size *per
wavefront*.

This causes problems in applications where multiple kernels use dynamic
scratch allocation and a later kernel requires more space than the
earlier kernel. The only application being tested that does this is
LULESH. This was resulting in the scratch space being too small,
resulting in workgroups clobbering each other's private memory leading
to some nasty bugs. It is fixed by this patch as task->amdQueue will be
re-read from the host and will contain the updated tmpring size. After
this there is enough scratch space and LULESH makes forward progress.

Change-Id: Ie9e0f92bb98fd3c3d6c2da3db9ee65352f9ae070
2023-10-04 09:38:31 -05:00
Matthew Poremba
cfa833a97d gpu-compute: Set LDS/scratch aperture base register
Starting with gfx900 (Vega) the LDS and scratch apertures can be queried
using a new s_getreg_b32 instruction. If the instruction is called with
the SH_MEM_BASES argument it returns the upper 16 bits of a 64 bit
address for the LDS and scratch apertures. The current addresses cannot
be encoded in this register, so that addresses are changed to have the
lower 48 bits be all zeros in addition to writing the bases register.

Change-Id: If20f262b2685d248afe31aa3ebb274e4f0fc0772
2023-08-31 11:01:32 -05:00
Matthew Poremba
60f071d09a gpu-compute,arch-vega: Implement flat scratch insts
Flat scratch instructions (aka private) are the 3rd and final segment of
flat instructions in gfx9 (Vega) and beyond. These are used for things
like spills/fills and thread local storage. This commit enables two
forms of flat scratch instructions: (1) flat_load/flat_store
instructions where the memory address resolves to private memory and (2)
the new scratch_load/scratch_store instructions in Vega. The first are
similar to older generation ISAs where the aperture is unknown until
address translation. The second are instructions guaranteed to go to
private memory.

Since these are very similar to flat global instructions there are
minimal changes needed:

- Ensure a flat instruction is either regular flat, global, XOR scratch
- Rename the global op_encoding methods to GlobalScratch to indicate
  they are for both and are intentionally used.
- Flat instructions in segment 1 output scratch_ in the disassembly
- Flat instruction executed as private use similar mem helpers as global
- Flat scratch cannot be an atomic

This was tested using a modified version of the 'square' application:

template <typename T>
__global__ void
scratch_square(T *C_d, T *A_d, size_t N)
{
    size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
    size_t stride = blockDim.x * gridDim.x ;

    volatile int foo; // Volatile ensures scratch / unoptimized code

    for (size_t i=offset; i<N; i+=stride) {
        foo = A_d[i];
        C_d[i] = foo * foo;
    }
}

Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d
2023-08-26 13:40:12 -05:00
Matthew Poremba
4506188e00 gpu-compute: Fix private offset/size register indexes
According to the ABI documentation from LLVM, the *low* register of flat
scratch (maxSGPR - 4) is the offset and the high register (maxSGPR - 3)
is size. These are currently backwards, resulting in some gnarly
addresses being generated leading to page fault and/or incorrect data.

This commit fixes this by setting the order correctly.

Change-Id: I0b1d077c49c0ee2a4e59b0f6d85cdb8f17f9be61
2023-08-26 13:40:12 -05:00
Matthew Poremba
e0379f4526 gpu-compute: Fix flat scratch resource counters
Flat instructions may access memory locations in LDS (scratchpad) and
global (VRAM/framebuffer) and therefore increment both counters when
dispatched. Once the aperture is known, we decrement the counters of the
aperture that was *not* used. This is done incorrectly for scratch /
private flat instruction. Private memory is global and therefore local
memory counters should be decremented.

This commit fixes the counters by changing the global decrements to
local decrements.

Change-Id: I25890446908df72e5469e9dbaba6c984955196cf
2023-08-26 13:40:12 -05:00