Fixed the assertion statement in the cpu's translation.hh file so that
it doesn't fail the assertion if the cache is clean.
I compile this c code to `test`
```c
#include <stdio.h>
static inline void clflush(volatile void *p) {
__asm__ volatile ("clflush (%0)" : : "r"(p) : "memory");
}
int main() {
int data = 42; // Example variable
printf("Value before clflush: %d\n", data);
clflush(&data);
printf("Value after clflush: %d\n", data);
return 0;
}
```
And run it with this script
`./build/X86/gem5.opt configs/learning_gem5/part1/two_level.py ./test`
In order to verify that it no longer fails the assertion check.
GitHub Issue: #862
Change-Id: I6004662e7c99f637ba0ddb07d205d1657708e99f
This feature has been available since Vega10 but was never implemented.
MI300 adds a few new instructions that make use of this more often
(e.g., v_mov_b64).
Change-Id: Ieeb7834462b76d77c0030f49622d0de09f90c9e4
This instruction is a simple move from accumulation register to
accumulation register. It is essentially a move with the accumulation
offset added to the register index.
Change-Id: Ic93ae72599b75c91213f56ebafe5bbd7b2867089
Flat, scratch, and global share the same instruction implementation with
different address calculations essentially. These instructions were
already implemented but not added to the decoder. This commit adds the
remaining scratch instructions which have a shared instruction
implementation.
Change-Id: I8f2e9ceb221294dce1b81c45745b642f0592d985
The lines `constexpr int B_I = std::ceil(64.0f / (N * M / H));` caused
the following compilation error in clang Version 16:
```
error: constexpr variable 'B_I' must be initialized by a constant
expression
```
`std::ceil` is not a const expression. Therefore instances of this
expression in instructions.hh have been replaced with a constant
expression friendly alternative.
This is calling our compiler tests to fail:
https://github.com/gem5/gem5/actions/runs/9288296434/job/25559409142
Change-Id: I74da1dab08b335c59bdddef6581746a94107f370
Currently when data is downgraded by MOESI_AMD_Base-CorePair (e.g. due
to a replacement) this requires a 4-way handshake between the CorePair
and the dir. Specifically, the CorePair send a message telling the dir
it'd like to downgrade then, the dir sends an ACK back and then, the
CorePair writes the data back, and finally, the dir ACKs the writeback.
This is very inefficient and not representative of how modern protocols
downgrade a request. Accordingly, this commits updates the downgrade
support such that the CorePair writes back the data immediately and then
the dir ACKs it.
Thus, this approach requires only a 2-way handshake.
Change-Id: I7ebc85bb03e8ce46a8847e3240fc170120e9fcd6
Co-authored-by: Neeraj Surawar <neerajs@hyrule.cs.wisc.edu>
When compiler tries to inline a vector construction with a default value
as default constructed ReplaceableEntry. It can complain about the
uninitialized member.
Let's provide basic initialization to the members.
Example codepath:
SignaturePathV2 constructor
-> GlobalHistoryEntry() as init_value to AssociativeSet
-> AssociativeSet initialize vector<Entry> with init_value
This PR is doing the following:
1) Fixing memory attributes of partial translation entries (table walks)
2) Properly setting the cacheability of table walks
Fix#1168. Prevent logical instructions like AND, OR, and TEST from
having input dependencies on the previous value of the Zaps register
(ZF+AF+PF+SF) by having them set AF=0, rather than not modifying AF.
Fix#1169. Break the input dependency of 32-bit and 64-bit 'mov'
micro-ops on the prior value in the destination register. Such a
dependency is required for 8-bit and 16-bit moves, as they do not
completely overwrite the value in the destination register. However, it
is unnecessary for 32-bit moves (which implicitly zero the upper 32
bits) and 64-bit moves.
This patch implements the fix by adding a new code template field inside
the generated constructors of X86StaticInst's, called `invalidate_srcs`,
which instruction implementations like `mov` can use to conditionally
invalidate particular source registers as needed. In `mov`'s case, this
is when the data size is 32 or 64 bits.
Change-Id: Ib2aef6be6da08752640ea3414b90efb7965be924
SDMA RLC queues do not currently remove their doorbell mapping. This can
cause issues re-registering the queue and prevents the pending doorbells
feature from working. In addition the data value of the doorbell (the
ring buffer rptr) is not saved, leading to UB when this workaround is
used.
This commit removes the doorbell mapping from the gpu device when the
SDMA engine unmaps an RLC queue and copies the next doorbell value to
the pending packet as was originally intended.
Change-Id: Ifd551450f439c065579afcf916f8ff192e7598ab
According to the Arm architecture reference manual, it is possible to
force the broadcast of the following TLBIs:
AArch64: TLBI VMALLE1, TLBI VAE1, TLBI ASIDE1, TLBI VAAE1, TLBI VALE1,
TLBI VAALE1, IC IALLU, TLBI RVAE1, TLBI RVAAE1, TLBI RVALE1, and TLBI
RVAALE1.
AArch32: BPIALL, TLBIALL, TLBIMVA, TLBIASID, DTLBIALL, DTLBIMVA,
DTLBIASID, ITLBIALL, ITLBIMVA, ITLBIASID, TLBIMVAA, ICIALLU, TLBIMVAL,
and TLBIMVAAL.
Via the HCR_EL2.FB bit
Change-Id: Ib11aa05cd202fadfbd9221db7a2043051196ecbd
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
When determining the cacheability of table walks,
SCTLR.C should only be used in stage1 EL1&0 translations.
Stage2 translations should rely on HCR_EL2.CD instead
Change-Id: I1b0830bc3fb5086f68d7a7a1560c7fed5d126d28
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Make table walks uncacheable if marked as uncacheable
in either inner or outer shareable domain
Change-Id: I5898a3b91b5b919e0beda6c6fe896394e3ab94df
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
StoreThrough in VIPER when the TCP is disabled, GLC bit is set, or SLC
bit is set will bypass the TCP, but will temporarily allocate a cache
entry seemingly to handle write coalescing with valid blocks. It does
not attempt to evict a block if the set is full and the address is
invalid. This causes a panic if the set is full as there is no spare
cache entry to use temporarily to use for DataBlk manipulation. However,
a cache block is not required for this.
This commit removes using a cache block for StoreThrough with invalid
blocks as there is no existing data to coalesce with. It creates no
allocate variants of the actions needed in StoreThrough and pulls the
DataBlk information from the in_msg instead. Non-invalid blocks do not
have this panic as they have a cache entry already.
Fixes issues with StoreThroughs on more aggressive architectures like
MI300.
Change-Id: Id8687eccb991e967bb5292068cbe7686e0930d7d
Those AArch64 instructions/registers were labelled as executable
from EL3 only if SCR_EL3.NS == 1. This is not valid anymore
after the introduction of FEAT_SEL2
The new static analysis in GCC 13 finds issues with operand.hh. This
commit fixes the error so that gem5 compiles when BUILD_GPU is true.
Change-Id: I6f4b0d350f0cabb6e356de20a46e1ca65fd0da55
Those AArch64 instructions/registers were labelled as executable
from EL3 only if SCR_EL3.NS == 1. This is not valid anymore
after the introduction of FEAT_SEL2
Change-Id: Ie7b56f3fe779c3a99d4f0ef937c7c8ec0530b00e
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
This is making it easier for TLBI instructions to share code. Common
code (under the form of tlbi* functions) are closely matching the
instruction description in the Arm pseudocode
Change-Id: If10c22fb4a7df2bcd0335e9761286ad3c458722b
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
The bit 0 of register should be 0 for jump address. Wrong handling the
jump address may cause infinite run or segment fault.
gem5 issue: https://github.com/gem5/gem5/issues/981
This change fixes#1148
I have only added an acknowledged return, as we dont ahve remote and
wrap mode so it can only be in stream mode.
Change-Id: I1882042d873ff0e9465c9491238554c8fbb9aa76
Those were not part of the performTlbi switch and simulation was
therefore panicking when they were encountered
Change-Id: Ifbe0b89e45539df4abc147ac5970b0caf0d9dfdc
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
This commit fixes and refactors the implementation of viota. It also
overrides the generateDisassembly function in viota's macro/micro to
correctly print out the instruction when tacing/debugging.
For example, it changes from:
viota_m vd, vd, vs2, v0.t
to:
viota_m vd, vs2, v0.t
This adds two failsafes which may cause a panic on some machines. First,
check the host machine has the KVM XCR capability before calling getXCRs
or setXCRs. Second, ensure the x87 bit, which must always be one, will
always return at least one by modifying the return value in readMiscReg.
Change-Id: I5e778acc926a47443ef6cef29fabd84eb69bb9ba
This implements some missing loads and store that are commonly used in
applications with MFMA instructions to load 16-bit data types into
specific register locations: DS_READ_U16_D16, DS_READ_U16_D16_HI,
BUFFER_LOAD_SHORT_D16, BUFFER_LOAD_SHORT_D16_HI.
Change-Id: Ie22d81ef010328f4541553a9a674764dc16a9f4d
Add a unit test for the MXFP types (bf16, fp16, fp8, bf8). These types
are not currently operated on directly. Instead the are cast to float
values and then arithmetic is performed. As a result, the unit test
simply checks that when we convert a value from MXFP type to float and
back that the values of the MXFP type match. Exact values are used to
avoid discrepancies with rounding.
Can be run using scons build/VEGA_X86/unittests.opt .
Change-Id: I596e9368eb929d239dd2d917e3abd7927b15b71e
These instructions are used in some of the F16 MFMA example applications
to convert to/from floating point types.
Change-Id: I7426ea663ce11a39fe8c60c8006d8cca11cfaf07
This instruction is new in MI300 and is used in some of the example
applications used to test MFMAs.
Change-Id: I739f8ab2be6a93ee3b6bdc4120d0117724edb0d4
This adds the decodings for all of the matrix fused multiply add (MFMA)
and sparse matrix fused multiply accumulate (SMFMAC) instructions up to
and including MI300. This does not yet provide the implementation for
these instructions, however it is easier and less tedious to add them in
bulk rather that one at a time.
Change-Id: I5acd23ca8a26bdec843bead545d1f8820ad95b41
The microscaling formats (MXFP) and INT8 types require additional size
checks which are not needed for the current MFMA template. The size
check is done using a constexpr method exclusive to the MXFP type,
therefore create a special class for MXFP types. This is preferrable to
attempting to shoehorn into the existing template as it helps with
readability. Similar, INT8 requires a size check to determine number of
elements per VGPR, but it not an MXFP type. Create a special template
for that as well.
This additionally implements all of the MFMA types which have test cases
in the amd-lab-notes repository (https://github.com/amd/amd-lab-notes/).
The implementations were tested using the applications in the
matrix-cores subfolder and achieve L2 norms equivalent or better than
MI200 hardware.
Change-Id: Ia5ae89387149928905e7bcd25302ed3d1df6af38
This class can be used to load multiple operand dwords into an array and
then select bits from the span of that array. It handles cases where the
bits span two dwords (e.g., you have four dwords for a 128-bit value and
want to select bits 35:30) and cases where multiple values < 32-bits are
packed into a single dword (e.g., two bf16 values).
This is most useful for packed arrays and instructions which have more
than two dwords. Beyond two dwords, the operator[] overload of
VectorOperand is not available requiring additional logic to select from
an operand. This helper class handles that additional logic itself.
Change-Id: I74856d0f312f7549b3b6c405ab71eb2b174c70ac
The open compute project (OCP) microscaling formats (MX) are used in the
GPU model. The specification is available at [1]. This implements a C++
version of MXFP formats with many constraints that conform to the
specification.
Actually arithmetic is not performed directly on the MXFP types. They
are rather converted to fp32 and the computation is performed. For most
of these types this is acceptable for the GPU model as there are no
instruction which directly perform arithmetic on them. For example, the
DOT/MFMA instructions operating may first convert to FP32 and then
perform arithmetic.
Change-Id: I7235722627f7f66c291792b5dbf9e3ea2f67883e
Release of MI300X simulation capability:
- Implements the required MI300X features over MI200 (currently only
architecture flat scratch).
- Make the gpu-compute model use MI200 features when MI300X / gfx942 is
configured.
- Fix up the scratch_ instructions which are seem to be preferred in
debug hipcc builds over buffer_.
- Add mi300.py config similar to mi200.py. This config can optionally
use resources instead of command line args.
It appears we have been trying to read 64-bit arguments for ARM32 since
695583709b. I noticed that SYS_OPEN was
trying to read a really long string as the pathname argument and it
turned out it was reading from the wrong stack offset. With this change
I can successfully run some of the semihosting tests for ARM32.
Change-Id: Ie154052dac4211993fb6c4c99d93990123c2eacf
In BaseSemihosting::readString() we were using the len argument to
allocate a std::vector without checking whether the value makes any
sense. This resulted in a std::bad_alloc exception being raised prior to
https://github.com/gem5/gem5/pull/1142 for my semihosting tests. This
commit prevents semihosting from reading more than 64K for string
arguments which should be more than sufficient for any valid code.
Change-Id: I059669016ee2c5721fedb914595d0494f6cfd4cd
This commit fixes the implementation of vrgather instruction based on
rvv 1.0.
In section 16.4. Vector Register Gather Instructions,
> Vector-scalar and vector-immediate forms of the register gather are
also provided. These read one element from the source vector at the
given index, and write this value to the active elements of the
destination vector register. The index value in the scalar register and
the immediate, zero-extended to XLEN bits, are treated as unsigned
integers. If XLEN > SEW, the index value is not truncated to SEW bits.
The fix zero-extends the index value in the scalar register and the
immediate.
Architected flat scratch is added in MI300 which store the scratch base
address in dedicated registers rather than in SGPRs. These registers are
used by scratch_ instructions. These are flat instruction which
explicitly target the private memory aperture. These instructions have a
different address calculation than global_ instructions.
This change implements architected flat scratch support, fixes the
address calculation of scratch_ instructions, and implements decodings
for some scratch_ instructions. Previous flat_ instructions which happen
to access the private memory aperture have no change in address
calculation. Since scratch_ instructions are identical to flat_
instruction except for address calculation, the decodings simply reuse
existing flat_ instruction definitions.
Change-Id: I1e1d15a2fbcc7a4a678157c35608f4f22b359e21
Add support for the following two extensions:
[Zvfh](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#185-zvfh-vector-extension-for-half-precision-floating-point):
Vector Extension for Half-Precision Floating-Point
[Zvfhmin](https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#184-zvfhmin-vector-extension-for-minimal-half-precision-floating-point):
Vector Extension for Minimal Half-Precision Floating-Point
For instructions (`vfncvt[.rtz].x[u].f.w`) and (`vfwcvt.f.x[u].v`) which
will become defined when `SEW = 8`, a new template
`VectorFloatWideningAndNarrowingCvtDecodeBlock` is added and 8-bit
floating point type (`float8_t`) is defined.
The data type `float8_t` is introduced in the newer `3e` version of the
SoftFloat Package, however, the current version in use is `3d` which
does not include this definition. Despite this, `float8_t` is utilized
solely for constructing the `vfncvt[.rtz].x[u].f.w` and
`vfwcvt.f.x[u].v` instructions when `SEW = 8`. There are no operations
that directly manipulate data of the `float8_t` type.
This is the version for MI300. For the most part, it is the same as
MI200 with the exception of architected flat scratch (not yet
implemented in gem5) and therefore a new version enum is required.
Change-Id: Id18cd7b57c4eebd467c010a3f61e3117beb8d58a