The ROM field was originally intended as a future alternate way to load
VBIOS without the ROM being on the disk image. This code path is never
taken for the devices gem5 supports and there is no gem5 implementation.
Deprecate the rom_binary field for this reason.
Similarly, MMIO traces were only used for Vega10. Deprecate this as
Vega10 is now deprecated. The MMIO trace reader is kept as it may still
be useful in the future. It is still the primary way to handle devies
which have graphics capability. None of the devices supported by gem5
have graphics now that Vega10 is deprecated.
Invalidate requests align to system cache line size. This causes
problems if the GPU cache hierarchy's cache line size is different than
the system as the unlaigned requests never return, leading to deadlock
on deferred dispatch.
This commit uses the cache line size from the GPU memory manager and
makes the cache line size there non-optional.
Tested with multiple RubySystems where CPU side was 64B and GPU side was
128B cache lines.
Vega10 is no longer officially supported by ROCm and ROCm is starting to
use some packet types not supported. These were originally kept to allow
users to use older disk images with newer gem5. Going forward the gem5
version and gem5-resources releases will be required to be the same to
prevent lingering old configs.
As a replacement for vega10*.py, mi300.py or mi200.py should be used.
HIP examples, cookbook, and rodinia configs can be replaced with the
standard flow of building / obtaining the GPU application and running
using mi300.py or mi200.py as they do not require any input options and
therefore do not require changes to the disk image.
This commit changes metric units (e.g. kB, MB, and GB) to binary units
(KiB, MiB, GiB) in various files. This PR covers files that were missed
by a previous PR that also made these changes.
This PR changes memory and cache sizes in various parts of the gem5
codebase to use binary units (e.g. KiB) instead of metric units (e.g.
kB). This makes the codebase more consistent, as gem5 automatically
converts memory and cache sizes that are in metric units to binary
units.
This PR also adds a warning message to let users know when an
auto-conversion from base 10 to base 2 units occurs.
There were a few places in configs and in the comments of various files
where I didn't change the metric units, as I couldn't figure out where
the parameters with those units were being used.
The Vega ISA's s_memtime instruction is used to obtain a cycle value
from the GPU. Previously, this was implemented to obtain the cycle count
when the memtime instruction reached the execute stage of the GPU
pipeline. However, from microbenchmarking we have found that this under
reports the latency for memtime instructions relative to real hardware.
Thus, we changed its behavior to go through the scalar memory pipeline
and obtain a latency value from the the SQC (L1 I$). This mirrors the
suggestion of the AMD Vega ISA manual that s_memtime should be treated
like a s_load_dwordx2.
The default latency was set based on microbenchmarking.
Change-Id: I5e251dde28c06fe1c492aea4abf9f34f05784420
This commit contains the rest of the base 2 vs base 10 cache/memory
size clarifications. It also changes the warning message to use
warn(). With these changes, the warning message should now no
longer show up during a fresh compilation of gem5.
Change-Id: Ia63f841bdf045b76473437f41548fab27dc19631
This is on by default in gem5 (see src/cpu/kvm/BaseKvmCPU.py), however
the perf counters only measure host instruction counters and GPUFS is
not concerned about accuracy of KVM CPU stats. There are also a larger
set of users who have access to KVM, but do not have the paranoid level
low enough to attach performance counters.
Therefore, make the performance counters OFF by default. They can still
be enabled, but this will allow for a larger set of users to follow the
upcoming GPUFS documentation without needing to read through a
troubleshooting section after seeing a gem5 error about the KVM paranoid
level.
Change-Id: I6b465559edf3ce17e7117ada049c60bd39aecd83
GPU_VIPER.py was modified to use these options but they did not exist,
breaking GPUFS. This commit adds them to fix the issue.
Change-Id: I0095f400ea606c4e8d91a41870ef208465cef803
Add a config capable of simulating MI300X ISA (gfx942). This is similar
to the mi200.py config and uses the same scripts followed by some
tuneable parameters. This config optionally lets the user call the
runMI300GPU function with gem5 resources. This allows for something like
the following before a VIPER stdlib python is available:
```
import mi300
from gem5.resources.resource import obtain_resource
disk = obtain_resource("x86-gpu-fs-img")
kernel = obtain_resource("x86-linux-kernel-5.4.0-105-generic")
app = obtain_resource("square-gpu-test")
mi300.runMI300GPUFS("X86KvmCPU", disk, kernel, app)
```
Tested cold boot config, checkpoint create and restore, and using gem5
resources.
Change-Id: I50a13d7a3d207786b779bf7fd47a5645256b1e6a
This is the version for MI300. For the most part, it is the same as
MI200 with the exception of architected flat scratch (not yet
implemented in gem5) and therefore a new version enum is required.
Change-Id: Id18cd7b57c4eebd467c010a3f61e3117beb8d58a
The address has one too many zeros and is therefore placed in a memory
region usually used for system memory. As a result this causes failure
when trying to run a simulation with a huge amount of memory.
Change the address to be within the C000'0000h - FFFF'FFFFh X86 I/O hole
as was intended.
Change-Id: I5d03ac19ea3b2c01a8c431073c12fa1868b3df24
A user reported a bug with the SSE4.1 version of memcmp in libc. When
enabled the simulated program crashes with SIGILL. After attempting all
fixes recommended by Intel SDM and still not working, turning the bit
off instead.
Similar, the default XSAVE functionality is not completely implemented
for AVX and newer ISA extensions. Therefore, there is not much point to
claiming to support the more advanced versions of XSAVE (XSAVEOPT,
XSAVEC, XSAVES, and XGETBV with ECX=1).
Note that none of these bits are enabled for non-GPU full system
simulations (see src/arch/x86/X86ISA.py). This only impacts GPUFS
simulations.
Change-Id: I8eb7bf0f2a0a29226095e7889fec9c1e8a65f88f
Currently gem5 assumes that there is only one command processor (CP)
which contains the PM4 packet processor. Some GPU devices have multiple
CPs which the driver tests individually during POST if they are used or
not. Therefore, these additional CPs need to be supported.
This commit allows for multiple PM4 packet processors which represent
multiple CPs. Each of these processors will have its own independent
MMIO address range. To more easily support ranges, the MMIO addresses
now use AddrRange to index a PM4 packet processor instead of the
hard-coded constexpr MMIO start and size pairs.
By default only one PM4 packet processor is created, meaning the
functionality of the simulation is unchanged for devices currently
supported in gem5.
Change-Id: I977f4fd3a169ef4a78671a4fb58c8ea0e19bf52c
The top level AMDGPUDevice currently reads/writes all unknown registers
to/from a map containing the previously written value. This is intended
as a way to handle registers that are not part of the model but the
driver requires for functionality. Since this is at the top level, it
can mask changes to register values which do not go through the same
interface. For example, reading an MMIO, changing via PM4 queue, and
reading again returns the stale cached value.
This commit removes the usage of the regs map in AMDGPUDevice,
implements some important MMIOs that were previously handled by it, and
moves the unknown register handling to the NBIO aperture only. To reduce
the number of additional MMIOs to implement, the display manager in
vega10 is now disabled.
Change-Id: Iff0a599dd82d663c7e710b79c6ef6d0ad1fc44a2
gpu-compute: Add support for skipping GPU kernels
This commit adds two new command-line options:
--skip-until-gpu-kernel N
Skips (non-blit) GPU kernels until the target kernel is reached.
Execution continues normally from there. Blit kernels are not skipped
because they are responsible for copying the kernel code and metadata
for the non-blit kernels. Note that skipping kernels can impact
correctness; this feature is only useful if the kernel of interest has
no data-dependent behavior, or its data-dependent behavior is not based
on data generated by the skipped kernels.
--exit-after-gpu-kernel N
Ends the simulation after completing (non-blit) GPU kernel N.
This commit also renames two existing command-line options:
--debug-at-gpu-kernel -> --debug-at-gpu-task
--exit-at-gpu-kernel -> --exit-at-gpu-task
These were renamed because they count GPU tasks, which include both
kernels launched by the application as well as blit kernels.
Change-Id: If250b3fd2db05c1222e369e9e3f779c4422074bc
This change fixes two issues:
1) The --cacheline_size option was setting the system cache line size
but not the Ruby cache line size, and the mismatch was causing assertion
failures.
2) The submitDispatchPkt() function accesses the kernel object in
chunks, with the chunk size equal to the cache line size. For cache line
sizes >64B (e.g. 128B), the kernel object is not guaranteed to be
aligned to a cache line and it was possible for a chunk to be partially
contained in two separate device memories, causing the memory access to
fail.
Change-Id: I8e45146901943e9c2750d32162c0f35c851e09e1
Co-authored-by: Michael Boyer <Michael.Boyer@amd.com>
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altered to increase or decrease
RFC latency benefit.
The RFC is defaulted to a size of 0 which removes it completely. To use
the RFC set the --register-file-cache-size to a non-zero multiple of
two. In addition, rfc_pipe_length may be altrered to increase or
decrease RFC latency benefit.
Change-Id: I6f5bf5b750eb64155fbc8c8343e9feadce5c9f79
Add a --no-kvm-perf option to disable KVM perf counters for GPUFS
scripts. This is useful for users who have KVM enabled but configured
with more restrictive settings, which seems to be the default in newer
Linux distros.
Change-Id: I7508113d0f7c74deb21ea7b2770522885a0ec822
GPUFS+KVM simulations automatically enable AVX. This commit adds a
command line option to disable AVX if its not needed for a GPUFS
simulation.
Change-Id: Ic22592767dbdca86f3718eca9c837a8e29b6b781
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
With this commit, the output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
Change-Id: I5e3866979659a2657f558941106ef65c2f4d9988
The MMIO trace contains register values for parts of the GPU that are
not modeled in gem5, such as registers related to the graphics core.
Since MI100 and MI200 do not have anything that is not modeled, the
MMIO trace is not needed, therefore it does not need to be used or
checked and the command line option goes away entirely for MI100/200.
Change-Id: I23839db32b1b072bd44c8c977899a99347fc9687
AVX is a requirement for some ROCm libraries, such as rocBLAS, which are
themselves requirements for libraries higher up the stack like PyTorch.
This patch sets the necessary CPUID bits in the GPUFS config to enable
AVX, AVX2, and various SSE features so that applications using these
libraries do not cause an illegal instruction trap.
Change-Id: Id22f543fb2a06b268271725a54075ee6a9a1f041
The unconditional exit event when a kernel completes that was added in
c644eae2dd is causing scripts that do not
ignore unknown exit events to end simulation prematurely. One such
script is the apu_se.py script used in SE mode GPU simulation. Make this
exit conditional to the parameter being set to a valid value to avoid
this problem.
Change-Id: I1d2c082291fdbcf27390913ffdffb963ec8080dd
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/72098
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Different GPUFS disk images have different root partitions that Linux
needs to boot from. In particular, Ubuntu's new installer has a GRUB
partition that cannot seem to be removed. Adding this as an option
prevents needing to edit a config script to change one character each
time a different disk image is used.
Change-Id: Iac2996ea096047281891a70aa2901401ac9746fc
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71918
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Add two kernel dispatch-based exit events that are useful for limiting
the simulation and enabling debug flags at specific GPU kernels. Since
the KVM CPU typically used with GPUFS is not deterministic, this help
with enabling debug flags when the Tick number may vary. The exit at GPU
kernel option can also limit simulation by only simulating a few hundred
kernels, for example, and exit at a determined point.
Change-Id: I81bae92a80c25fc38c41e999aa662e1417b7a20d
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71418
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Currently the amdgpu simulated device is assumed to be a Vega10. As a
result there are a few things that are hardcoded. One of those is the
number of SDMAs. In order to add a newer device, such as MI100+, we need
to enable a flexible number of SDMAs.
In order to support a variable number of SDMAs and with the MMIO offsets
of each device being potentially different, the MMIO interface for SDMAs
is changed to use an SDMA class method dispatch table with forwards a
32-bit value from the MMIO packet to the MMIO functions in SDMA of the
format `void method(uint32_t)`. Several changes are made to enable this:
- Allow the SDMA to have a variable MMIO base and size. These are
configured in python.
- An SDMA class method dispatch table which contains the MMIO offset
relative to the SDMA's MMIO base address.
- An updated writeMMIO method to iterate over the SDMA MMIO address
ranges and call the appropriate SDMA MMIO method which matches the
MMIO offset.
- Moved all SDMA related MMIO data bit twiddling, masking, etc. into
the MMIO methods themselves instead of in the writeMMIO method in
SDMAEngine.
Change-Id: Ifce626f84d52f9e27e4438ba4e685e30dbf06dbc
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70040
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Previously the CPU type and memory modes were hardcoded for KVM, because
there was a deadlock bug. After some recent testing, this deadlock bug
no longer exists with the simple CPU models. Thus, changing the configs
to allow for other CPU models as a first step toward lifting the KVM
requirement from GPUFS.
Change-Id: Ib616c3ef60f173871421b55a8bb73b25ce2990b5
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69979
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
This file is a required input to the simulator for GPUFS. There seems to
be confusion from several users who are not providing this input. This
usually results in the amdgpu driver failing to load, leading to the
application under test exiting along with it.
This changeset adds a simple md5 hashsum check to compare against the
known good MMIO trace located in the gem5-resources repository.
Change-Id: I59819fc795a6bc4bc6badbd4d120db1246498987
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69978
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
The dmesg level is currently set to 3 which will not display errors if
the amdgpu driver fails to load. Changing to level 8 will show errors in
the gem5 terminal and is not too spammy. This will help GPUFS developers
with bug reports since we would actually be able to observe an error.
Currently if the driver fails to load, there is no way to detect it and
applications will attempt to run, usually failing on getting device
properties.
Change-Id: I56b9581c1a12a8ce329066d18d6a072d006c096d
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69977
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
An example case,
```python
mem_side_port = RequestPort(
"This port sends requests and " "receives responses"
)
```
This is the residue of running the python formatter.
This is done by finding all tokens matching the regex `"\s"(?![.;"])`
and manually replacing them by empty strings.
Change-Id: Icf223bbe889e5fa5749a81ef77aa6e721f38b549
Signed-off-by: Hoa Nguyen <hoanguyen@ucdavis.edu>
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66111
Reviewed-by: Bobby Bruce <bbruce@ucdavis.edu>
Maintainer: Bobby Bruce <bbruce@ucdavis.edu>
Tested-by: kokoro <noreply+kokoro@google.com>
The KVM CPU hangs if there are not multiple event queues when more than
one CPU is created. Since GPUFS primarily relies on the KVM CPU, support
for multiple event queues is needed. Some GPU libraries, such as AMD
Research's ATMI library, assume more than one CPU.
This changeset adds support for multiple CPUs and was tested for up to
four CPUs.
Change-Id: Ia354e02209d0fa18195f3ad44f4fb1d58e93b5ca
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65131
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Support has been added for SDMA RLC queues which are used for host to
device and device to host "memcpy" calls. Previously the SDMA engine was
disabled which caused GPU BLIT kernels to be called. This removes the
environment variable disabling SDMAs which has two main benefits:
- It will be much easier to debug host/device transfer by using SDMA
debug flag.
- Simulation time is improved since we no longer need detailed GPU
simulation to copy data and instead are doing a simple large DMA
Change-Id: I7524245731d301b5c26394318f2156ed6b4c983a
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62717
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
The environment variable HSA_ENABLE_INTERRUPT controls if Interrupt or
busy wait signals are used in the ROCm runtime. Interrupts are not being
sent in gem5 causing simulations to hang indefinitely in certain
situations. To fix this, always disable interrupts to fall back to busy
wait signals. Using interrupts is an old and simple optimization to not
waste CPU cycles, but from the perspective of simulation this is not
important. Disabling interrupt-based HSA signals therefore increases the
number of applications working within gem5.
Change-Id: I1ae21d7ee01548a4d00a8972642079b90278f9a2
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61652
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Jason Lowe-Power <power.jg@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>