We should not try to check vtype when decoding the instruction.
It should be checked in vset{i}vl{i} since the register can be
modified via vset{i}vl{i}
Change-Id: I403e5c4579bc5b8e6af10f93eac20c14662e4d2d
The RM field of ModRM was printed as Reg field for several instructions.
For reference, this change fixes typos introduced by [1].
[1] https://gem5-review.googlesource.com/c/public/gem5/+/40339
Change-Id: I41eb58e6a70845c4ddd6774ccba81b8069888be5
Signed-off-by: Hoa Nguyen <hn@hnpl.org>
Flat scratch instructions (aka private) are the 3rd and final segment of
flat instructions in gfx9 (Vega) and beyond. These are used for things
like spills/fills and thread local storage. This commit enables two
forms of flat scratch instructions: (1) flat_load/flat_store
instructions where the memory address resolves to private memory and (2)
the new scratch_load/scratch_store instructions in Vega. The first are
similar to older generation ISAs where the aperture is unknown until
address translation. The second are instructions guaranteed to go to
private memory.
Since these are very similar to flat global instructions there are
minimal changes needed:
- Ensure a flat instruction is either regular flat, global, XOR scratch
- Rename the global op_encoding methods to GlobalScratch to indicate
they are for both and are intentionally used.
- Flat instructions in segment 1 output scratch_ in the disassembly
- Flat instruction executed as private use similar mem helpers as global
- Flat scratch cannot be an atomic
This was tested using a modified version of the 'square' application:
template <typename T>
__global__ void
scratch_square(T *C_d, T *A_d, size_t N)
{
size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
size_t stride = blockDim.x * gridDim.x ;
volatile int foo; // Volatile ensures scratch / unoptimized code
for (size_t i=offset; i<N; i+=stride) {
foo = A_d[i];
C_d[i] = foo * foo;
}
}
Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.
This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.
Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c
On RISC-V when trap occurs the contents of PC register contains the
address of instruction that caused that trap (as opposed to the address
of instruction following it in instruction stream). Therefore this commit
does not advance the PC before reporting trap in SE mode.
Change-Id: I83f3766cff276312cefcf1b4ac6e78a6569846b9
Due to inverted logic in POWER fault handlers, unimplemented opcode and
trap faults did not report trap to GDB (if connected). This commit fixes
the problem.
While at it, I opted to use `if (! ...) { panic(...) }` rather than
`panic_if(...)`. I find it easier to understand in this case.
Change-Id: I6cd5dfd5f6546b8541d685e877afef21540d6824
The mem instructions usually executed from initiateAcc. We also need
to check vector required in those instructions
Change-Id: I97b4fec7fada432abb55ca58050615e12e00d1ca
Detected via this failing workload:
https://github.com/gem5/gem5/actions/runs/5861958237
Ir caused the following compilation error to be thrown:
```
src/arch/x86/kvm/x86_cpu.cc:1462:22: error: unused variable ‘rv’ [-Werror=unused-variable]
1462 | bool rv = isa->cpuid->doCpuid(tc, function, idx, cpuid);
| ^~
```
`rv` is unused in the .fast compilation as it's only used in the
`assert` statement immediately after.
To fix this, the `[[maybe_unused]]` annotation is used.
Detected via this failing workload:
https://github.com/gem5/gem5/actions/runs/5861958237
It caused the following compilation error to be thrown:
```
src/arch/x86/kvm/x86_cpu.cc:1462:22: error: unused variable ‘rv’ [-Werror=unused-variable]
1462 | bool rv = isa->cpuid->doCpuid(tc, function, idx, cpuid);
| ^~
```
`rv` is unused in the .fast compilation as it's only used in the
`assert` statement immediately after.
To fix this, the `[[maybe_unused]]` annotation is used
Change-Id: Ib98dd859c62f171c8eeefae93502f92a8f133776
We're seeing some occasional connection timeouts in CI, possibly
when we aggressively hit the license server, so let's add a
parameter to retry the connection a few times.
Also, print the time required to connect to the server to help
debug issues.
Change-Id: I804af28f79f893fcdca615d7bf82dd9b8686a74c
Since the O3 and Minor CPU models do not support RVV right now as the
implementation stalls the decode until vsetvl instructions are exectued,
this change calls `fatal` if RVV is not explicitly enabled.
It is possible to override this if you explicitly enable RVV in the
config file.
Change-Id: Ia801911141bb2fb2bedcff3e139bf41ba8936085
Signed-off-by: Jason Lowe-Power <jason@lowepower.com>
TODOs:
+ vcompress.vm
Change-Id: I86eceae66e90380416fd3be2c10ad616512b5eba
Co-authored-by: Yang Liu <numbksco@gmail.com>
Co-authored-by: Fan Yang <1209202421@qq.com>
Co-authored-by: Jerin Joy <joy@rivosinc.com>
arch-riscv: Add LICENCE to template files
Change-Id: I825e72bffb84cce559d2e4c1fc2246c3b05a1243
* TODOs:
+ Vector Segment Load/Store
+ Vector Fault-only-first Load
Change-Id: I2815c76404e62babab7e9466e4ea33ea87e66e75
Co-authored-by: Yang Liu <numbksco@gmail.com>
Co-authored-by: Fan Yang <1209202421@qq.com>
Co-authored-by: Jerin Joy <joy@rivosinc.com>
Change-Id: I84363164ca327151101e8a1c3d8441a66338c909
Co-authored-by: Yang Liu <numbksco@gmail.com>
Co-authored-by: Fan Yang <1209202421@qq.com>
arch-riscv: Add a todo to fix vsetvl stall on decode
Change-Id: Iafb129648fba89009345f0c0ad3710f773379bf6
This commit add regs and configs for vector extension
* Add 32 vector arch regs as spec defined and 8 internal regs for
uop-based vector implementation.
* Add default vector configs(VLEN = 256, ELEN = 64). These cannot
be changed yet, since the vector implementation has only be tested
with such configs.
* Add disassamble register name v0~v31 and vtmp0~vtmp7.
* Add CSR registers defined in RISCV Vector Spec v1.0.
* Add vector bitfields.
* Add vector operand_types and operands.
Change-Id: I7bbab1ee9e0aa804d6f15ef7b77fac22d4f7212a
Co-authored-by: Yang Liu <numbksco@gmail.com>
Co-authored-by: Fan Yang <1209202421@qq.com>
Co-authored-by: Jerin Joy <joy@rivosinc.com>
arch-riscv: enable rvv flags only for RV64
Change-Id: I6586e322dfd562b598f63a18964d17326c14d4cf
Starting with ROCm 5.4+, MI100 and MI200 make use of the translate
further bit in the page table. This bit enables mixing 4kiB and 2MiB
pages and is functionally equivalent to mixing page sizes using the
PDE.P bit for which gem5 currently has support.
With PDE.P bit set, we stop walking and the page size is equal to the
level in the page table we stopped at. For example, stopping at level
2 would be a 1GiB page, stopping at level 3 would be a 2MiB page.
This assumes most pages are 4kiB.
When the F bit is used, it is assumed most pages are 2MiB and we will
stop walking at the 3rd level of the page table unless the F bit is set.
When the F bit is set, the 2nd level PDE contains a block fragment size
representing the page size of the next PDE in the form of 2^(12+size).
If the next page has the F bit set we continue walking to the 4th level.
The block fragment size is hardcoded to 9 in the driver therefore we
assert that the block fragment size must be 0 or 9.
This enables MI200 with ROCm 5.4+ in gem5. This functionality was
determine by examining the driver source code in Linux and there is no
public documentation about this feature or why the change is made in or
around ROCm 5.4.
Change-Id: I603c0208cd9e821f7ad6eeb1d94ae15eaa146fb9
When using the new operator, delete should be called
on any allocated memory after it's use is complete.
Change-Id: Id5fcfb264b6ddc252c0a9dcafc2d3b020f7b5019
A previous change added a vop2Helper to remove 100s of lines of common
code from VOP2 instructions related to processing SDWA and DPP support.
That change inadvertently changed the type of operand source 0 from
const to non-const. The vector container operator[] does not allow
reading a scalar value such as a constant, a dword literal, etc. The
error shows up in the form of: assert(!scalar) in operand.hh.
Since the SDWA and DPP cases need to modify the source vector and
non-SDWA/DPP cases might require const, we make a non-const copy of the
const source 0 vector and place it in a temporary non-const vector. This
non-const vector is passed to the lambda function implementation of the
instruction. This prevents needing a const and non-const version of the
lambda and avoids needing to propagate the template parameters through
the various SDWA/DPP helper methods which seems like it will not work
anyways as they need to modify the vector.
As a result of this, as more VOP2 instructions are implemented using
this helper, they will need to specify the const and non-const template
parameters of the vector container needed for the instruction.
Change-Id: Ia0b3c550d7de32b830040007a110f4821e3385aa
When using the new operator, delete should be called
on any allocated memory after it's use is complete.
Change-Id: Id5fcfb264b6ddc252c0a9dcafc2d3b020f7b5019
A previous change added a vop2Helper to remove 100s of lines of common
code from VOP2 instructions related to processing SDWA and DPP support.
That change inadvertently changed the type of operand source 0 from
const to non-const. The vector container operator[] does not allow
reading a scalar value such as a constant, a dword literal, etc. The
error shows up in the form of: assert(!scalar) in operand.hh.
Since the SDWA and DPP cases need to modify the source vector and
non-SDWA/DPP cases might require const, we make a non-const copy of the
const source 0 vector and place it in a tempoary non-const vector. This
non-const vector is passed to the lambda function implementation of the
instruction. This prevents needing a const and non-const version of the
lambda and avoids needing to propagate the template parameters through
the various SDWA/DPP helper methods which seems like it will not work
anyways as they need to modify the vector.
As a result of this, as more VOP2 instructions are implemented using
this helper,they will need to specify the const and non-const template
parameters of the vector container needed for the instruction.
Change-Id: Ia0b3c550d7de32b830040007a110f4821e3385aa
The extended state CPUID function is used to set the values of the XCR0
register as well as specify the size of storage for context switching
storage for x87 and AVX+. This function is iterative and therefore
requires (1) marking it as such in the hsaSignificantIndex function (2)
setting multiple sets of 4-tuples for the default CPUID values where the
last 4-tuple ends with all zeros.
Change-Id: Ib6a43925afb1cae75f61d8acff52a3cc26ce17c8
Related to the recent changes with moving CPUID values to python, this
value is needed to enable AVX and needs a way to be exposed to python as
well in order to set the bit and the corresponding CPUID values at the
same time.
Change-Id: I3cadb0fe61ff4ebf6de903018a8d8a411bfdb4e0
Various CPUID functions will return different values depending on the
value of ECX when executing the CPUID instruction. Add support for this
in the X86 KVM CPU. A subsequent patch will add a CPUID function which
requires iterating through multiple ECX values.
Change-Id: Ib44a52be52ea632d5e2cee3fb2ca390b60a7202a
CPUID values for X86 are currently hard-coded in the C++ source file.
This makes it difficult to configure the bits if needed. Move these to
python instead. This will provide a few benefits:
1. We can enable features for certain configurations, for example AVX
can be enabled when the KVM CPU is used, but otherwise should not be
enabled as gem5 does not have full AVX support.
2. We can more accurately communicate things like cache/TLB sizes based
on the actual gem5 configuration. The CPUID values are can be used by
some libraries, e.g., MPI, to query system topology.
3. Enabling some bits breaks things in certain configurations and this
can be prevented by configuring in python. For example, enabling AVX
seems to currently be breaking SMP, meaning gem5 can only boot one CPU
in that configuration.
Change-Id: Ib3866f39c86d61374b9451e60b119a3155575884
FEAT_TLBIOS has been introduced by a recent patch [1] which
was however missing to include the outer shareable case in the
Msr disambiguation switch. Which meant the TLBIOS instructions
were decoded as normal MSR instructions, with no effect whatsoever
on the TLBs
[1]: https://gem5-review.googlesource.com/c/public/gem5/+/70567
Change-Id: I41665a4634fbe0ee8cc30dbc5d88d63103082ae9
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
When shiftAmt is 0 for a UQRSHL instruction, the code called bits() with
incorrect arguments. This fixes a left-shift of 0 to be a NOP/mov, as
required.
Change-Id: Ic86ca40ac42bfb767a09e8c65a53cec56382a008
Co-authored-by: Marton Erdos <marton.erdos@arm.com>
* gpu-compute: Remove use of 'std::random_shuffle'
This was deprecated in C++14 and removed in C++17. This has been
replaced with std::random. This has been implemented to ensure
reproducible results despite (pseudo)random behavior.
Change-Id: Idd52bc997547c7f8c1be88f6130adff8a37b4116
* dev-amdgpu: Add missing 'overrides'
This causes warnings/errors in some compilers.
Change-Id: I36a3548943c030d2578c2f581c8985c12eaeb0ae
* dev: Fix Linux specific includes to be portable
This allows for compilation in non-linux systems (e.g., Mac OS).
Change-Id: Ib6c9406baf42db8caaad335ebc670c1905584ea2
* tests: Add 'VEGA_X86' build target to compiler-tests.sh
Change-Id: Icbf1d60a096b1791a4718a7edf17466f854b6ae5
* tests: Add 'GCN3_X86' build target to compiler-tests.sh
Change-Id: Ie7c9c20bb090f8688e48c8619667312196a7c123
Vega adds multiple new D16 instructions which load a byte or short into
the lower or upper 16 bits of a register for packed math. The decoder
table has subDecode tables for FLAT instructions which represents 32
opcodes in each subDecode table. The subDecode table for opcodes 32-63
is missing so it is added here.
The opcode for V_SWAP_B32 is also off by one- In the ISA manual this
instruction is opcode 81, the instruction before is 79, and there is no
opcode 80, so the decoder entry is swapped with the invalid decoding
below it.
Change-Id: I278fea574ea684ccc6302d5b4d0f5dd8813a88ad
Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71899
Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>
Maintainer: Matt Sinclair <mattdsinclair@gmail.com>
Tested-by: kokoro <noreply+kokoro@google.com>