This patch fixes the buggy special path comparisons in
src/kern/linux/linux.cc Linux::openSpecialFile(), which only checked for
equality of path prefixes, but not equality of the paths themselves.
This patch replaces those buggy comparisons with regular
std::string::operator== string equality comparisons.
GitHub issue: https://github.com/gem5/gem5/issues/269
This ensures that if the CI tests are running for a PR, and a new
workflow is triggered (typically by pushing/rebasing the PR) then the
older workflow is cancelled.
This ensures that if the CI tests are running for a PR, and a new
workflow is triggered (typically by pushing/rebasing the PR) then the
older workflow is cancelled.
Change-Id: Ifa172bdbdac09c5a91abb41a0162c597445e4e2e
The bug report template used escape characters. This is not necessary as
the bug report is not rendered when creating a bug report. It is
displayed to the user in plain text for them to edit.
In addition languages have been added to the code-blocks and newlines
have been added and removed where appropriate to cleanup the document.
This patch fixes the buggy special path comparisons in
src/kern/linux/linux.cc Linux::openSpecialFile(), which only checked
for equality of path prefixes, but not equality of the paths
themselves. This patch replaces those buggy comparisons with
regular std::string::operator== string equality comparisons.
GitHub issue: https://github.com/gem5/gem5/issues/269
Change-Id: I216ff8019b9a6a3e87e364c2e197d9b991959ec1
Currently the MOESI_AMD_Base-directory transition for system level
atomics sends the response message before the atomic is performed. This
was likely done because atomics are supposed to return the value of the
data *before* the atomic is performed and by simply ordering the actions
this way that was taken care of.
With the new atomic log feature, the atomic values are pulled from the
log by the coalescer on the return path. Therefore, these actions can be
reordered. In fact, it is now necessary that the atomics be performed
before sending the response so that the log is populated and copied by
the response action. This should fix#253 .
There were some weird newline characters in this file, or lack of lines.
This patch adds/removes them.
Change-Id: I6cc918788c07bbc4be5c68401ad3987be00fffc4
The bug_report.md is rendered as plain text, not markdown, when creating
a bug report. As such the escape characters are removed in this commit.
Change-Id: I524c66ae61d00b7ed59153ba9f4b2297ff50ee18
Currently the MOESI_AMD_Base-directory transition for system level
atomics sends the response message before the atomic is performed. This
was likely done because atomics are supposed to return the value of the
data *before* the atomic is performed and by simply ordering the actions
this way that was taken care of.
With the new atomic log feature, the atomic values are pulled from the
log by the coalescer on the return path. Therefore, these actions can be
reordered. However, it is now necessary that the atomics be performed
before sending the response so that the log is populated and copied by
the response action. This should fix#253 .
Change-Id: Ie7e178f93990975367de2cc3e89e5ef9c9069241
gdb was originally part of the ROCm 1.6 Dockerfile a few years ago. It
got removed when updating to ROCm 4.0. This adds it back as being able
to debug things is quite useful.
This isn't necessary. Without 'run-name' the action's default name is
'run-name'. Displaying the actor who launched the action is pointless
for scheduled tests.
Starting with gfx900 (Vega) the LDS and scratch apertures can be queried
using a new s_getreg_b32 instruction. If the instruction is called with
the SH_MEM_BASES argument it returns the upper 16 bits of a 64 bit
address for the LDS and scratch apertures. The current addresses cannot
be encoded in this register, so that addresses are changed to have the
lower 48 bits be all zeros in addition to writing the bases register.
The RM field of ModRM was printed as Reg field for several instructions.
For reference, this change fixes typos introduced by [1].
[1] https://gem5-review.googlesource.com/c/public/gem5/+/40339
Change-Id: I41eb58e6a70845c4ddd6774ccba81b8069888be5
Signed-off-by: Hoa Nguyen <hn@hnpl.org>
gdb was originally part of the ROCm 1.6 Dockerfile a few years ago. It
got removed when updating to ROCm 4.0. This adds it back as being able
to debug things is quite useful.
Change-Id: I3f8148cde79e6cc5233fa3c8c830b64817f01d3a
Starting with gfx900 (Vega) the LDS and scratch apertures can be queried
using a new s_getreg_b32 instruction. If the instruction is called with
the SH_MEM_BASES argument it returns the upper 16 bits of a 64 bit
address for the LDS and scratch apertures. The current addresses cannot
be encoded in this register, so that addresses are changed to have the
lower 48 bits be all zeros in addition to writing the bases register.
Change-Id: If20f262b2685d248afe31aa3ebb274e4f0fc0772
This isn't necessary. Without 'run-name' the action's default name is
'run-name'. Displaying the actor who launched the action is pointless
for scheduled tests.
Change-Id: I15d52959389881381ef7685efb57152c5162c89d
Augmenting the DataBlock class with a change log structure to record the
effects of atomic operations on a data block and service these changes
if the atomic operations require return values.
Although the operations are atomic, the coalescer need not send unique
memory requests for each operation. Atomic operations within a wavefront
to the same address are now coalesced into a single memory request. The
response of this request carries all the necessary information to
provide the requesting lanes unique values as a result of their
individual atomic operations. This helps reduce contention for request
and response queues in simulation.
Previously, only the final value of the datablock after all atomic ops
to the same address was visible to the requesting waves. This change
corrects this behavior by allowing each wave to see the effect of this
individual atomic op is a return value is necessary.
This is built to test the following assumptions:
1. We can trigger a GitHub action event on the changing of a
file/directory.
2. We can use GitHub actions to build a docker image.
3. We can use GitHub actions to push a docker image to a container
registry.
4. We can use GitHub's container registry.
Right now this will only build and push ubuntu-20.04_all-depenencies, as
a test.
This patch allows a local JSON file to specify a local path in the JSON
object of a Resource, through the "url" field.
Local paths can be entered with the prefix "file:" in the "url" field.
If the local path exists, then the Resource from there is copied into
the resource directory defined in the
function earlier.
This behavior is the same as using specific Resource classes (ex.
BinaryResource) and passing a local_path into the function.
But, the above class does not allow simultaneous creation of local
Resources and Workloads of those local Resources.
With this patch, someone can use a local JSON, specify the location of
local Resources and create a Workload from those Resources and test both
together.
This patch allows a local JSON file to specify a local path
in the JSON object of a Resource, through the "url" field.
Local paths can be entered with the prefix "file:" in "url".
All File URI scheme formats are supported.
This behavior is the same as using specific Resource classes
(ex. BinaryResource) and passing a local_path into the function.
But, the above infrastructure does not allow simultaneous
creation of Resources and Workloads of those Resources.
With this patch, someone can use a local JSON, specify the location
of local Resources and create a Workload from those Resources and
test both together.
Also, this patch adds pyunit tests to check the functionality
of the function used to convert the "url" field into a path.
Change-Id: I1fa3ce33a9870528efd7751d7ca24c27baf36ad4
There is conversion error in ./util/streamline/m5stats2streamline.py
script to convert gem5 stats.txt,sys, system.tasks.txt to the apc folder
required by DS-5 streamline. The fix to the bug can convert to apc
folder without error. The zipped apc folder can then be imported in
older DS-5 v5.24 for visualization (didn't work with DS-5 v5.29).
Changes:
1) writeBinary function binary_list can have either string or ints and
it needs to be properly converted to bytes
2) packed32(x) function can have x as int or float. Incase of float it
needs to be converted to int
The bug was reported and solved primarily in the issue
https://github.com/gem5/gem5/issues/145
Change-Id: I6a52aa59e1582dd6bb06b2d1c49ddaf8fe61c997
This is built to test the following assumptions:
1. We can trigger a GitHub action event on the changing of a
file/directory.
2. We can use GitHub actions to build a docker image.
3. We can use GitHub actions to push a docker image to a container
registry.
4. We can use GitHub's container registry.
Right now this will only build and push ubuntu-20.04_all-depenencies, as
a test.
Change-Id: Ie1a55c97c6eef26281456c908e1200b27da4d961
DCT must be disabled when handling a ReadUnique where the copy need to
be upgraded.
Previously we were just asserting as it was assumed DCT is only enabled
for HNFs (which can "auto-upgrade"). However DCT may also be enabled for
intermediated levels of distributed shared caches above the HNFs.
Currently we generate these stats for all defined Events in the
protocol, which may generate too many stats that are never used. Though
these don't appear in the stats.txt file, they unnecessarily increases
simulation startup time and memory footprint.
This patch limits those stats to events with the "in_trans" and/or
"out_trans" properties. SLICC compiler then checks which combinations of
event+state are possible when generating the stats.
Also the possible level of detail for inTransLatHist was reduced.
Only the number of transactions for each event+initial+final state
combinations is now accounted. Latency histograms are only defined per
event type (similarly to outTransLatHist). This significantly reduces
the final file size for generated stats.
1) writeBinary function binary_list can have either string or ints and it needs to be properly converted to bytes
2) packed32(x) function can have x as int or float. Incase of float it needs to be converted to int
3) encode lines to string using .decode() or else TypeError will be invoked during run
Change-Id: I678169f191901f02a80187418a17adbc1240c7d3
1) writeBinary function binary_list can have either string or ints and it needs to be properly converted to bytes
2) packed32(x) function can have x as int or float. Incase of float it needs to be converted to int
Change-Id: I6a52aa59e1582dd6bb06b2d1c49ddaf8fe61c997
Flat scratch instructions (aka private) are the 3rd and final segment of
flat instructions in gfx9 (Vega) and beyond. These are used for things
like spills/fills and thread local storage. This commit enables two
forms of flat scratch instructions: (1) flat_load/flat_store
instructions where the memory address resolves to private memory and (2)
the new scratch_load/scratch_store instructions in Vega. The first are
similar to older generation ISAs where the aperture is unknown until
address translation. The second are instructions guaranteed to go to
private memory.
Since these are very similar to flat global instructions there are
minimal changes needed:
- Ensure a flat instruction is either regular flat, global, XOR scratch
- Rename the global op_encoding methods to GlobalScratch to indicate
they are for both and are intentionally used.
- Flat instructions in segment 1 output scratch_ in the disassembly
- Flat instruction executed as private use similar mem helpers as global
- Flat scratch cannot be an atomic
This was tested using a modified version of the 'square' application:
template <typename T>
__global__ void
scratch_square(T *C_d, T *A_d, size_t N)
{
size_t offset = (blockIdx.x * blockDim.x + threadIdx.x);
size_t stride = blockDim.x * gridDim.x ;
volatile int foo; // Volatile ensures scratch / unoptimized code
for (size_t i=offset; i<N; i+=stride) {
foo = A_d[i];
C_d[i] = foo * foo;
}
}
Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d
According to the ABI documentation from LLVM, the *low* register of flat
scratch (maxSGPR - 4) is the offset and the high register (maxSGPR - 3)
is size. These are currently backwards, resulting in some gnarly
addresses being generated leading to page fault and/or incorrect data.
This commit fixes this by setting the order correctly.
Change-Id: I0b1d077c49c0ee2a4e59b0f6d85cdb8f17f9be61
Flat instructions may access memory locations in LDS (scratchpad) and
global (VRAM/framebuffer) and therefore increment both counters when
dispatched. Once the aperture is known, we decrement the counters of the
aperture that was *not* used. This is done incorrectly for scratch /
private flat instruction. Private memory is global and therefore local
memory counters should be decremented.
This commit fixes the counters by changing the global decrements to
local decrements.
Change-Id: I25890446908df72e5469e9dbaba6c984955196cf
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.
This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.
Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
The functional HSA signal read was a hack left in the gpu-compute code.
In full system, this functional read is causing problems occasionally
with the translation not yet being in the page table. The error message
output by gem5 was a fatal message on the readBlob method in port proxy.
Changing this to a timing DMA fixes this problem.
This commit adds the various timing DMA functions to send and receive
response and clean up. A helper method "sendCompletionSignal" is added
to the GPUCommandProcessor because the indentation level was getting too
deep. This change applies only to FS mode. Code for SE mode is
equivalent to what it was before this commit.
Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48
This updates all the jobs for our CI tests to make sure they
don't run tests on draft pull request, and only trigger when
ready for review
Change-Id: I3fe7ae373c39fc6ef594c0c71c6f10e7319553d8
This is an experiment. The runners were sometimes running out of memory
building gem5. The builders have more memory so should be able to
handling this. The runners have 4-cores so compilation should be faster
(note the inclusion of the `-j$(nproc)`.
configs,dev-amdgpu: Add PCI express capability info
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
The second commit, output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
First commit is a minor typo fix changing PCI capability struct to
union.
In the RISC-V system, we need to VecClassReg to run RISC-V vector
instruction, and VecElemReg is not applicable because the element length
of vector can be resizable via vset\*vl\* instruction.
The change will seperate the reg_index for VecReg and VecElemReg to
ensure that have the space for VecReg when VecElemReg is not applicable.
When an Evict request is received from upstream for a shared line and
the line is no longer cached locally (or on any other upstream cache),
we need to also send an Evict downstream. In this case we need to wait
until our outgoing Evict completes before completing the Evict from
upstream in order be able to resolve race conditions with incoming
snoops. E.g.: while our outgoing Evict is pending we may receive a snoop
requesting data, but we won't be able to complete this snoop if we have
already completed all upstream Evicts and we no longer have the line.
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.
This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.
Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c
The ROCm stack requires PCI express atomics. Currently the first PCI
CapabilityPtr does not point to anything, which signals to the OS
(Linux) that this is an early generation PCI device. As PCI express
atomics were introduced later, the CapabilityPtr needs to point to at
least a PCI express capability structure. This capability is defined as
0x10 in Linux. We additionally set the PCI atomic based bits and
implement device specific PCI configuration space reads and writes to
the amdgpu device.
With this commit, the output of simulation when loading the amdgpu
driver no longer outputs "PCIE atomics not supported". Further, an
application which uses PCIe atomics (PyTorch with a reduce_sum kernel)
now makes further progress.
Change-Id: I5e3866979659a2657f558941106ef65c2f4d9988
The goal is to fix this issue which appears to be affects some Apple
users: https://github.com/gem5/gem5/issues/94.
By specializing the `EXC_*` to gem5 we avoid the name conflicts plagiing
some users.
In the RISC-V system, we need to VecClassReg to run RISC-V vector
instruction, and VecElemReg is not applicable because the element
length of vector can be resizable via vset*vl* instruction.
The change will seperate the reg_index for VecReg and VecElemReg to
ensure that have the space for VecReg when VecElemReg is not
applicable.
Change-Id: I99a82dec273baeee31df89a0ee0f5e87f3ff187c
The capabilities for PCI express is a struct, instead of a union, like
the other capability unions. A union is used here to provide access to
the ordinal data values when reading/writing an offset while
simultaneously providing human readable field values that can be set
when writing the code.
This commit changes it to union which is likely should be. Nothing
appears to be using this union yet so it is likely an oversight.
Change-Id: I85fe7cc62914525c70fd7a5946d725ed308f8775
There are a few LDS instructions that perform local ALU operations and
writeback which are marked as loads. These are marked as loads because
they fit in the pipeline logic better, according to a several year old
comment. In the VEGA ISA these instructions (swizzle, permute, bpermute)
are not decrementing the LDS load counter. As a result, the counter will
gradually increase over time. Since wavefront slots are persistent, this
can cause applications with a few thousand kernels to eventually hang
thinking there are not enough resources.
This changeset fixes this by decrementing the LDS load counter for these
instructions. This fix was already integrated in the GCN3 ISA in the
exact same way. This changeset moves it near a similar comment about
scheduling register file writes.
Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c