The struct fields of the SMMUEvent were not matching the SMMUv3 specs.
This was "not an issue" as events have been implicitly disabled until
now (every translation error was aborting simulation)
With generateEvent we automatically construct a SMMU event from
a translation result.
Change-Id: Iba6a08d551c0a99bb58c4118992f1d2b683f62cf
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
A faulting translation should return additional information
(other than the fault type). This will be used by future
patches to properly populate the SMMU event record of the
event queue
As we currenlty support two faults only:
1) F_TRANSLATION
2) F_PERMISSION
We add to TranslResult the relevant fault information only:
type, class, stage and ipa
Change-Id: I0a81d5fc202e1b6135cecdcd6dfd2239c2f1ba7e
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reading the Context Descriptor (CD) might require a stage2
translation. At the moment doReadCD does not check for the
return value of the translateStage2.
This means that any stage2 fault will be silently discarded
and an invalid address will be used/returned.
By returning a translation result we make sure any error
happening in the second stage of translation will be properly
flagged
Change-Id: I2ecd43f7e23080bf8222bc3addfabbd027ee8feb
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
We don't check the fault type directly. This will improve
readability once the TranslResult class will be augmented
with extra fields
Change-Id: I5acafaabf098d6ee79e1f0c384499cc043a75a9d
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
One of the limitations of the RegBank class is that it does not allow
you to pass a non-contiguous set of registers. Its simplest form will
just accept an initializer list of registers and it will store them in
sequence.
A more refined version [1] will optionally accept an offset value to be
passed alongside the register reference. This is not meant to be used by
the register bank to store the register at the provided offset.
It is rather used by the bank to sanity check the register sits exactly
at the provided range.
The way to work around this for a fragemented register space is to
explicitly allocate RAZ/RAO blocks as registers and to pass them to
addRegisters together with the others. (See the SysSecCtrl [2] as an
example)
This makes it a bit tedious to model a register bank with gaps between
its registers. First, the exact number and position of the gaps needs to
be extraced from a spec. These sometimes report only implemented
registers and their offset, and omit to document gaps/reserved space. So
a developer needs to manually add register offset and size to check if
all registers are contiguous. Second, these reserved register blocks
need to be instantiated in the bank adding boilerplate code and
affecting readibility.
For these reasons we add a new registration method, called
addRegisters*At*. It reuses the RegisterAdder class but this time the
offset field is really used to instruct the bank where the register
should be mapped. The method is templated and the template parameter
tells the bank which register type should be used to fill the remaining
space. We make the RegBank the owner of this filler space (registers are
generated internally within addRegistersAt).
[1]: https://github.com/gem5/gem5/blob/stable/src/dev/reg_bank.hh#L106
[2]: https://github.com/gem5/gem5/blob/stable/src/dev/arm/ssc.cc#L48
Change-Id: I614ae6e9eeb40b365ac9b6dd8b75abbfdb9cb687
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Update the PLIC based on the
[riscv-plic-spec](https://github.com/riscv/riscv-plic-spec) in the PR:
- Support customized PLIC hardID and privilege mode configuration
- Backward compatable with the n_contexts parameter, will generate the
config like {0,M}, {0,S}, {1,M} ...
Change-Id: Ibff736827edb7c97921e01fa27f503574a27a562
ArmSigInterruptPin don't send the interrupt to GIC. Instead it sends the
interrupt to the irq specified in Param. When using ArmSigInterruptPin,
we shouldn't ask users to provide "Platform" since it doesn't need it.
To reduce the confusion, this change removes the dependency of Platform
for ArmSigInterruptPin.
Change-Id: I0ee507ed1c08b4fa6d3e384e28732f3acb4f6892
The PCI configuration space is 256 bytes, yet because the
PCI_CONFIG_SIZE macro is 0xff, the final register allocation in the IDE
controller only allocated up to byte 255.
Change-Id: I1aef2cad9df366ee8425edb410037061eb29ae33
If the NOP count of an SDMA NOP packet goes beyond the wptr address, the
queue decode method will loop infinitely. If a packet comes in with a
bad count this causes gem5 to hang. This change advances the rptr one
dword at a time until either reaching the NOP count or when rptr == wptr
to prevent this issue.
Change-Id: Ib2c0f74a477bff27890c9c064bb4190e76e513bd
Related to issue #703 , this PR removes GCN3 related files and updates
source code, documentation, and tests to switch over to Vega is that was
not done already. Highlights are:
- Remove all src/arch/amdgpu/gcn3 files and update Kconfigs.
- Remove references to GCN3 and replace with Vega where applicable.
- Update the build targets in the gcn-gpu Docker. This will need to be
rebuilt but not urgently.
- Remove the GCN3 tag in testlib. Most tests seem to be using Vega
already, so that commit is small.
By default all SDMA queues are privileged queues, meaning the addresses
in SDMA packets use the privileged translation tables. RLC queues
(sometimes called user queues) are not necessarily privileged and might
use user translation tables. RLC queues are used more often in ROCm 6.0
exposing an issue with invalid translations with RLC queues.
This changeset checks the priv bit in the SDMA MQD when an RLC queue is
mapped. Each packet type which uses an address then checks the bit
before performing translation. Tested with daily/weekly tests with a
ROCm 6.0 disk image and tests are passing.
Change-Id: I6122fbc194e8d6f5d38e81f1b0e11646d90e0ea0
Replace instances of "GCN3" with Vega. Remove gfx801 and gfx803. Rename
FIJI to Vega and Carrizo to Raven.
Using misc since there is not enough room to fit all the tags.
Change-Id: Ibafc939d49a69be9068107a906e878408c7a5891
The PR contains the following changes:
- Move all of the config options(`env["CONF"]`) from SConsopt to Kconfig
files
- Update `build_opts` files to Kconfig option formats
- The Ruby Protocol files are only built if `RUBY=y`
- Remove the default-default build target
- Kconfig commands are included in the PR:
- defconfig
- setconfig
- meunconfig
- guiconfig
- listnewconfig
- savedefconfig
- oldconfig
- olddefconfig
- Add the `python3-tk` package dependencies
Jira issue: https://gem5.atlassian.net/browse/GEM5-1211
The GPU device keeps a local copy of each ring buffers read pointer
(rptr) to avoid constant DMAs to/from host memory. This means it needs
to be periodically updated on the host side as the driver uses this to
determine how much space is left in the queue and may hang if it believe
the queue is full. For user-mode queues, this already happens when
queues are unmapped. For kernel mode queues (e.g., HIQ, KIQ) the rptr is
never updated leading to a hang.
In this patch the rptr for *all* queues is reported back to the kernel
whenever the queue reaches an empty state (rptr == wptr). Additionally
to handle PM4 queue wrap-around, the queue processing function checks if
the queue is not empty instead of rptr < wptr. This is state because the
driver fills PM4 queues with NOP packets on initialization and when wrap
around occurs.
Change-Id: Ie13a4354f82999208a75bb1eaec70513039ff30f
These are not yet consumed by anything, but convert all the settings
from SCons variables to Kconfig variables.
If you have existing SConsopts files which need to be converted, you
should take a look at KCONFIG.md to learn about how kconfig is used in
gem5. You should decide if any variables need to be available to C++ or
kconfig itself, and whether those are options which should be detected
automatically, or should be up to the user. Options which should be
measured automatically should still be in SConsopts files, while user
facing options should be added to new or existing Kconfig files.
Generally, make sure you're storing c++/kconfig visible options in
env['CONF'][...]. Also remove references to sticky_vars since persistent
options should now be handled with kconfig, and export_vars since
everything in env['CONF'] is now exported automatically.
Switch SCons/gem5 to use Kconfig for configuration, except EXTRAS which
is still a sticky SCons variable. This is necessary because EXTRAS also
controls what config options exist. If it came from Kconfig itself, then
there would be a circular dependency. This dependency could
theoretically be handled by reparsing the Kconfig when EXTRAS
directories were added or removed, but that would be complicated, and
isn't supported by kconfiglib. It wouldn't be worth the significant
effort it would take to add it, just to use Kconfig more purely.
Change-Id: I29ab1940b2d7b0e6635a490452d05befe5b4a2c9
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit adds existing IDs to the GPU device's used
VMID map so that new doorbells are aware of existing queue IDs and use a
new ID. This ensures that queue IDs are unique after checkpoint
restoration
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit checkpoints the existing VMID map so that any
new doorbells after restoration use a unique queue ID
Change-Id: I9bf89a2769db26ceab4441634ff2da936eea6d6f
https://github.com/gem5/gem5/pull/386 included two cases in
"src/dev/reg_bank.hh" where `std:: min` was used to compare a an integer
of type `size_t` and another of type `Addr`. This cause an error on my
Apple Silicon Mac as this is a comparison between an "unsigned long"
and an "unsigned long long" which (at least on my setup) was not
permitted. To fix this issue the `reg_size` was changed from `size_t` to
`Addr`, as well as it the types of the values it was derived from and
the variable used to hold the return from the `std::min` calls.
Change-Id: I31e9c04a8e0327d4f6f5390bc5a743c629db4746
When restoring checkpoints for certain applications, gem5 tries to
create new doorbells with a pre-existing queue ID and simulation crashes
shortly after. This commit adds existing IDs to the GPU device's used
VMID map so that new doorbells are aware of existing queue IDs and use a
new ID. This ensures that queue IDs are unique after checkpoint
restoration
Change-Id: I9bf89a2769db26ceab4441634ff2da936eea6d6f
Print extra logs for the full/partial read/write access to the registers
through the register bank. The debug flag is empty by default and would
not print anything.
Test: run unittest of dev/reg_bank.test.xml to check the behavior would
not affect the original functionality.
run gem5 with debug flags and use m5term to poke on registers.
The amdgpu driver can, at *any* time, tell the device to unmap a queue
to force the queue descriptor to be written back to main memory in the
form of a memory queue descriptor (MQD). It will then immediately remap
the queue and continue writing the doorbell to the queue. It is possible
that the doorbell write occurs after the queue is unmapped but before it
is remapped. In this situation, we need to check the updated value of
the doorbell for the queue and write that to the queue after it is
mapped.
To handle this, a pending doorbell packet map is created to hold a
packet to replay when the queue is mapped. Because PCI in gem5
implements only the atomic protocol port, we cannot use the original
packet as it must respond in the same Tick. This patch fixes issues with
the doorbell maps not being cleared on unmapping to ensure the doorbell
is not found in writeDoorbell and places in the pending doorbell map.
This includes fixing the doorbell offset value in the doorbell to VMID
map which was is now multiplied by four as it is a dword address.
This was tested using tensorflow 2.0's MNIST example which was seeing
this issue consistently. With this patch it now makes progress and does
issue pending doorbell writes.
Change-Id: Ic6b401d3fe7fc46b7bcbf19a769cdea6814e7d1e
gem5 does not currently implement any vendor-specific HSA packets.
Starting in ROCm 5.5, vendor packets appear to end with a completion
signal. Not sending this completion causes gem5 to hang. Since these
packets are not documented anywhere and need to be reverse engineered we
send the completion signal, if non-zero, and finish the packet as is the
current behavior.
Testing: HIP examples working on most recent ROCm release (5.7.1).
Change-Id: Id0841407bec564c84f590c943f0609b17e01e14c
While it's plausible to define the cache_line_size as a 32-bit unsigned
int, the use of cache_line_size is way out of its original scope.
cache_line_size has been used to produce an address mask, which masking
out the offset bits from an address. For example, [1], [2], [3], and
[4]. However, since the cache_line_size is an "unsigned int", the type
of the value is not guaranteed to be 64-bit long. Subsequently, the bit
twiddling hacks in [1], [2], [3], and [4] produce 32-bit mask, i.e.,
0x00000000FFFFFFC0.
This behavior at least caused a problem in LLSC in RISC-V [5], where the
load reservation (LR) relies on the mask to produce the cache block
address. Two distinct 64-bit addresses can be mapped to the same cache
block using the above mask.
This patch explicitly defines cache_line_size as a 64-bit unsigned int
so the cache block mask can be produced correctly for 64-bit addresses.
[1]
3bdcfd6f7a/src/cpu/simple/atomic.hh (L147)
[2]
3bdcfd6f7a/src/cpu/simple/timing.hh (L224)
[3]
3bdcfd6f7a/src/cpu/o3/lsq_unit.cc (L241)
[4]
3bdcfd6f7a/src/cpu/minor/lsq.cc (L1425)
[5]
3bdcfd6f7a/src/arch/riscv/isa.cc (L787)
Earlier, GPU checkpointing was working only if a checkpoint was created
before the first kernel execution. This pull request adds support to
checkpoint in-between any two kernel calls. It does so by doing the
following.
- Adds flush support in the GPU_VIPER protocol
- Adds flush support in the GPUCoalescer
- Updates cache recorder to use the GPUCoalescer during simulation
cooldown and cache warmup times.
The ROCr runtime uses a combination of HSA signal timestamps and
hardware MMIOs to calculate profiling times. At the beginning of an
application a timestamp is read from the GPU using MMIOs. The clock
MMIOs reside in the GFX MMIO region, so a new AMDGPUGfx class is added
to handle these MMIOs.
The timestamp value is expected to be in nanoseconds, so we simply use
the gem5 tick converted to ns.
Change-Id: I7d1cba40d5042a7f7a81fd4d132402dc11b71bd4
The AMD specific HSA signal contains start/end timestamps for dispatch
packet completion signals. These are current always zero. These
timestamp values are used for profiling in the ROCr runtime.
Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check
for zero values before dividing, causing applications that use profiling
to crash with SIGFPE. Profiling is used via hipEvents in the HACC
application, so these should be supported in gem5.
In order to handle writing the timestamp values, we need to DMA the
values to memory before writing the completion signal. This changes the
flow of the async completion signal write to be (1) read mailbox pointer
(2) if valid, write the mailbox data, other skip to 4 (3) write mailbox
data if pointer is valid (4) write timestamp values (5) write completion
signal. The application will process the timestamp data as soon as the
completion signal is received, so we need to ordering to ensure the DMA
for timestamps was completed.
HACC now runs to completion on GPUFS and has the same output was
hardware.
Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e
During checkpoint restoration, the unserialize() function writes rptr,
wptr, and indirect buffer rptr, wptr to PM4 queue's rptr, wptr fields.
This commit updates this to write only the relevant pointers to the
queue structure. If indirect buffers are used, then it writes only the
indirect buffer pointers to the queue. If they are not used, then it
writes rptr, wptr values to the queue.
Change-Id: Iedb25a726112e1af99cc1e7bc012de51c4ebfd45
GPUFS uses aql information from PM4 queues to initialize doorbells. This
commit adds aql information to the checkpoint so that it can be used
during restoration to correctly initialize all doorbells. Additionally,
this commit also sets the hsa queue correctly during checkpoint-restoration
Change-Id: Ief3ef6dc973f70f27255234872a12c396df05d89
It is possible to execute a GPU atomic instruction using a memory
address that is in the host memory space (e.g, HMM, __managed__,
hipHostMalloc'd address). Since these are in host memory they are passed
to the SystemHub DmaDevice. However, this currently executes as a write
packet without modifying data. This leads to hangs in applications that
use atomics for forward progress (e.g., HeteroSync).
It is not clear where these are handled on a real GPU, but they are
certianly not handled by the software stack nor driver, so they must be
handled in hardware and therefore implemented in gem5. Handling for
atomics in the SystemHub makes the most sense.
To make atomics work a few extra changes need to be made to the
SystemHub. (1) The atomic is implemented as a host memory read, followed
by calling the AtomicOpFunctor, followed by a write. This requires a
second event to handle read response, performing atomic, and issuing a
write. (2) Atomics must be serialized otherwise two atomics might return
the same value which is incorrect. This patch adds serialization logic
for all request types to the same address to handle this. (3) With the
added complexity of the SystemHub, a new debug flag explicitly for
SystemHub is added.
Testing done: The heterosync application with input "sleepMutex 10 16 4"
previously hung before this patch. It passes with the patch applied.
This application tests both (1) and (2) above, as it allocates locks
with hipHostMalloc and has multiple workgroups sending an atomic request
in the same Tick, verifying the serialization mechanism.
Change-Id: Ife84b30037d1447dd384340cfeb06fdfd472fff9
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.
This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.
Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
With this commit, the output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
Change-Id: I5e3866979659a2657f558941106ef65c2f4d9988
The capabilities for PCI express is a struct, instead of a union, like
the other capability unions. A union is used here to provide access to
the ordinal data values when reading/writing an offset while
simultaneously providing human readable field values that can be set
when writing the code.
This commit changes it to union which is likely should be. Nothing
appears to be using this union yet so it is likely an oversight.
Change-Id: I85fe7cc62914525c70fd7a5946d725ed308f8775
This SDMA packet is much more common starting around ROCm 5.4.
Previously this was mostly used to clear page tables after an
application ended and was therefore left unimplemented. It is
now used for basic operation like device memsets.
This patch implements constant fill as it is now necessary.
Change-Id: I9b2cf076ec17f5ed07c20bb820e7db0c082bbfbc
When using the new operator, delete should be called
on any allocated memory after it's use is complete.
Change-Id: Id5fcfb264b6ddc252c0a9dcafc2d3b020f7b5019
* gpu-compute: Remove use of 'std::random_shuffle'
This was deprecated in C++14 and removed in C++17. This has been
replaced with std::random. This has been implemented to ensure
reproducible results despite (pseudo)random behavior.
Change-Id: Idd52bc997547c7f8c1be88f6130adff8a37b4116
* dev-amdgpu: Add missing 'overrides'
This causes warnings/errors in some compilers.
Change-Id: I36a3548943c030d2578c2f581c8985c12eaeb0ae
* dev: Fix Linux specific includes to be portable
This allows for compilation in non-linux systems (e.g., Mac OS).
Change-Id: Ib6c9406baf42db8caaad335ebc670c1905584ea2
* tests: Add 'VEGA_X86' build target to compiler-tests.sh
Change-Id: Icbf1d60a096b1791a4718a7edf17466f854b6ae5
* tests: Add 'GCN3_X86' build target to compiler-tests.sh
Change-Id: Ie7c9c20bb090f8688e48c8619667312196a7c123
The PCI read/write functions are atomic functions in gem5, meaning they
expect a response with a latency value on the same simulation Tick. For
reads to a PCI device, the response must also include a data value read
from the device.
The AMDGPU device has a PCI BAR which mirrors the frame buffer memory.
Currently reads are done atomically, but writes are sent to a DMA device
without waiting for a write completion ACK. As a result, it is possible
that writes can be queued in the DMA device long enough that another
read for a queued address arrives. This happens very deterministically
with the AtomicSimpleCPU and causes GPUFS to break with that CPU.
This change makes writes to the frame BAR atomic the same as reads. This
avoids that problem and as a result the AtomicSimpleCPU can now load the
driver for GPUFS simulations.
Change-Id: I9a8e8b172712c78b667ebcec81a0c5d0060234db
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71898
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matthew Poremba <matthew.poremba@amd.com>
Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>
According to the GIC specification (IHI0069) reserved addresses in the
GIC memory map are treated as RES0. We allow to disable this behaviour
and panic instead (reserved_res0 = False, which is what we have been
doing so far) to catch development bugs (in gem5 and in the guest SW)
Change-Id: I23f98519c2f256c092a52425735b8792bae7a2c7
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71138
Reviewed-by: Richard Cooper <richard.cooper@arm.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Patch https://gem5-review.googlesource.com/c/public/gem5/+/70040 added
support for a variable number of SDMA engines to support newer GPU
models. As part of this an SDMA IDs map was added to map from SDMA ID
number to the SDMA SimObject pointer. In order to get the correct
pointer in unserialize now, we need to store the ID in the checkpoint
and use that to index the new map. We can't simply assign using the loop
variable as the SDMAs might not be in order in the checkpoint and
additionally the checkpoint contains both the gfx and page offset for
the SDMA engines, so each SDMA is inserted into the SDMA offset map
(sdmaEngs) twice.
Change-Id: I08e9a8d785f467b6eebff8ab0a9336851c87258d
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70878
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>