1) writeBinary function binary_list can have either string or ints and it needs to be properly converted to bytes
2) packed32(x) function can have x as int or float. Incase of float it needs to be converted to int
Change-Id: I6a52aa59e1582dd6bb06b2d1c49ddaf8fe61c997
Flat scratch instructions (aka private) are the 3rd and final segment of
flat instructions in gfx9 (Vega) and beyond. These are used for things
like spills/fills and thread local storage. This commit enables two
forms of flat scratch instructions: (1) flat_load/flat_store
instructions where the memory address resolves to private memory and (2)
the new scratch_load/scratch_store instructions in Vega. The first are
similar to older generation ISAs where the aperture is unknown until
address translation. The second are instructions guaranteed to go to
private memory.
Since these are very similar to flat global instructions there are
minimal changes needed:
- Ensure a flat instruction is either regular flat, global, XOR scratch
- Rename the global op_encoding methods to GlobalScratch to indicate
they are for both and are intentionally used.
- Flat instructions in segment 1 output scratch_ in the disassembly
- Flat instruction executed as private use similar mem helpers as global
- Flat scratch cannot be an atomic
This was tested using a modified version of the 'square' application:
template <typename T>
__global__ void
scratch_square(T *C_d, T *A_d, size_t N)
{
size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
size_t stride = blockDim.x * gridDim.x ;
volatile int foo; // Volatile ensures scratch / unoptimized code
for (size_t i=offset; i<N; i+=stride) {
foo = A_d[i];
C_d[i] = foo * foo;
}
}
Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d
According to the ABI documentation from LLVM, the *low* register of flat
scratch (maxSGPR - 4) is the offset and the high register (maxSGPR - 3)
is size. These are currently backwards, resulting in some gnarly
addresses being generated leading to page fault and/or incorrect data.
This commit fixes this by setting the order correctly.
Change-Id: I0b1d077c49c0ee2a4e59b0f6d85cdb8f17f9be61
Flat instructions may access memory locations in LDS (scratchpad) and
global (VRAM/framebuffer) and therefore increment both counters when
dispatched. Once the aperture is known, we decrement the counters of the
aperture that was *not* used. This is done incorrectly for scratch /
private flat instruction. Private memory is global and therefore local
memory counters should be decremented.
This commit fixes the counters by changing the global decrements to
local decrements.
Change-Id: I25890446908df72e5469e9dbaba6c984955196cf
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.
This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.
Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.
This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.
Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
This is an experiment. The runners were sometimes running out of memory
building gem5. The builders have more memory so should be able to
handling this. The runners have 4-cores so compilation should be faster
(note the inclusion of the `-j$(nproc)`.
configs,dev-amdgpu: Add PCI express capability info
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
The second commit, output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
First commit is a minor typo fix changing PCI capability struct to
union.
In the RISC-V system, we need to VecClassReg to run RISC-V vector
instruction, and VecElemReg is not applicable because the element length
of vector can be resizable via vset\*vl\* instruction.
The change will seperate the reg_index for VecReg and VecElemReg to
ensure that have the space for VecReg when VecElemReg is not applicable.
When an Evict request is received from upstream for a shared line and
the line is no longer cached locally (or on any other upstream cache),
we need to also send an Evict downstream. In this case we need to wait
until our outgoing Evict completes before completing the Evict from
upstream in order be able to resolve race conditions with incoming
snoops. E.g.: while our outgoing Evict is pending we may receive a snoop
requesting data, but we won't be able to complete this snoop if we have
already completed all upstream Evicts and we no longer have the line.
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.
This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.
Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
With this commit, the output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
Change-Id: I5e3866979659a2657f558941106ef65c2f4d9988
The goal is to fix this issue which appears to be affects some Apple
users: https://github.com/gem5/gem5/issues/94.
By specializing the `EXC_*` to gem5 we avoid the name conflicts plagiing
some users.
In the RISC-V system, we need to VecClassReg to run RISC-V vector
instruction, and VecElemReg is not applicable because the element
length of vector can be resizable via vset*vl* instruction.
The change will seperate the reg_index for VecReg and VecElemReg to
ensure that have the space for VecReg when VecElemReg is not
applicable.
Change-Id: I99a82dec273baeee31df89a0ee0f5e87f3ff187c
The capabilities for PCI express is a struct, instead of a union, like
the other capability unions. A union is used here to provide access to
the ordinal data values when reading/writing an offset while
simultaneously providing human readable field values that can be set
when writing the code.
This commit changes it to union which is likely should be. Nothing
appears to be using this union yet so it is likely an oversight.
Change-Id: I85fe7cc62914525c70fd7a5946d725ed308f8775
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.
This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.
Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c
This is an experiment. The runners were sometimes running out of memory
building gem5. The builders have more memory to handle this. The runners
have 4-cores so compilation should be faster (note the inclusion of the
`-j$(nproc)`.
Change-Id: I964c5a778938b449502d92dec3431f8b788397e4
When an Evict request is received from upstream for a shared line
and the line is no longer cached locally (or on any other upstream
cache), we need to also send an Evict downstream. In this case we need
to wait until our outgoing Evict completes before completing the Evict
from upstream in order be able to resolve race conditions with incoming
snoops. E.g.: while our outgoing Evict is pending we may receive a
snoop requesting data, but we won't be able to complete this snoop if
we have already completed all upstream Evicts and we no longer have the
line.
Change-Id: I23ac4f0a9c4ddd81e2425376c8d1e1c7fb66d107
Signed-off-by: Tiago Mück <tiago.muck@arm.com>
1. The current SignalSinkPort and SignalSourcePort have no
ways to assign the init value of the state. Add a new constructor
for them with the param init_state
2. After the source and sink are bound, the state at both side should
be the same. Set the the state of sink to the state of source in the
bind() function.
Change-Id: Idde0a12aa0ddd0c9c599ef47059674fb12aa5d68
Compilation bug found on:
https://github.com/gem5/gem5/actions/runs/5899831222/job/16002984553
In gcc Version 8 and below the following error is received:
```
src/base/bitfield.hh: In function ‘constexpr int gem5::findLsbSet(uint64_t)’:
src/base/bitfield.hh:365:34: error: call to non-‘constexpr’ function ‘int gem5::{anonymous}::findLsbSetFallback(uint64_t)’
return findLsbSetFallback(val);
~~~~~~~~~~~~~~~~~~^~~~~
scons: *** [build/ALL/kern/linux/events.o] Error 1
```
`findLsbSet` cannot be `constexr` as it calls non-constexpr function
`findLsbSetFallback`. `findLsbSetFallback`. The problematic function is
the `count` on the std::bitset.
This patch changes this to a constexpr.
Upload the config script to make it only for riscv asmtest and replace
Resource with obtain_resourse
Change-Id: I0bab96ea352b7ce1c6838203bfa13eee795f41f9
The goal is to fix this issue which appears to be affects some Apple
users: https://github.com/gem5/gem5/issues/94.
By specializing the `EXC_*` to gem5 we avoid the name conflicts plagiing
some users.
Change-Id: I031f7110b4b4ae82677b6586903cd57b22ca2137
Compilation bug found on:
https://github.com/gem5/gem5/actions/runs/5899831222/job/16002984553
In gcc Version 8 and below the following error is received:
```
src/base/bitfield.hh: In function ‘constexpr int gem5::findLsbSet(uint64_t)’:
src/base/bitfield.hh:365:34: error: call to non-‘constexpr’ function ‘int gem5::{anonymous}::findLsbSetFallback(uint64_t)’
return findLsbSetFallback(val);
~~~~~~~~~~~~~~~~~~^~~~~
scons: *** [build/ALL/kern/linux/events.o] Error 1
```
`findLsbSet` cannot be `constexr` as it calls non-constexpr function
`findLsbSetFallback`. `findLsbSetFallback`. The problematic function is
the `count` on the std::bitset.
This patch changes this to a constexpr.
Change-Id: I48bd15d03e4615148be6c4d926a3c9c2f777dc3c
When a linux kernel changes a page property, it flushes the related
cache lines. The kernel might change the page property before flushing
the cache lines. This results in the clflush might occur in an
uncacheable region.
Currently, an uncacheable request must be a read or a write. However,
clflush request is neither of them.
This change aims to allow clflush requests to work on uncacheable
regions. Since there is no straightforward way to check if a packet is
from a clflush instruction, this change permits all Clean Invalidate
Requests, which is the type of request produced by clflush, to work on
uncacheable regions.
This fixes:
1. Most importantly: The submodule recursive update was incorrect. This
adds the recursive obtaining of submodules as a seperate explicity step.
2. Changes the `git clone` to use https.
Update the cxx_config_cc.oy port description generation to use the
port.is_source attribute.
Github Issue: https://github.com/gem5/gem5/issues/181
Change-Id: I3fa12c2fbb06083379118e57aedb8be414c0d929
When a linux kernel changes a page property, it flushes the related cache
lines. The kernel might change the page property before flushing the
cache lines. This results in the clflush might occur in an uncacheable region.
Currently, an uncacheable request must be a read or a write. However,
clflush request is neither of them.
This change aims to allow clflush requests to work on uncacheable regions.
Since there is no straightforward way to check if a packet is from a clflush
instruction, this change permits all Clean Invalidate Requests, which is
the type of request produced by clflush, to work on uncacheable regions.
Change-Id: Ib3ec01d9281d3dfe565a0ced773ed912edb32b8f
Signed-off-by: Hoa Nguyen <hn@hnpl.org>
This fixes:
1. Most importantly: The submodule recursive update was incorrect. This
adds the recursive obtaining of submodules as a seperate explicity step.
2. Changes the `git clone` to use https.
Change-Id: Iad69e44b927a5aa982b49dffa6929c52fcc7ee72
Added save and restore checkpoint tests for arm-hello, x86-hello,
x86-fs, power-hello
Added mips and sparc test but mips does not support checkpoint and there
is a bug in sparc.
Added test file to run the tests.
This changes continue-on-error to be fail-fast instead, as
continue-on-error will mark failed matrix runs as
successful, whereas fail-fast makes sure everything in the matrix runs,
but gets marked as failed if part of it fails.
In this case the function is turned into a generator with the
"yield" of the generator the return the function's execution.
Change-Id: I4b06d64c5479638712a11e3c1a2f7bd30f60d188
This changes continue-on-error to be fail-fast instead, as
continue-on-error will mark failed matrix runs as
successful, whereas fail-fast makes sure everything in the matrix
runs, but gets marked as failed if part of it fails.
Change-Id: Ie20652c229b6cce9f1c0a45958b088391e7aae97
Any instructions require vector register should check if vector is
enabled. Any instructions need vtype CSR to execute them should check
vill bit beforehead.