This PR updates cache recorder to use a vector of RubyPorts for cache
cooldown and warmup instead of Sequencer or GPUCoalescer vectors (refer
to issue #403 for more details). It also removes the extra guards that
were added in #377 to prevent compile-time failures in non-GPU builds.
A previous commit added BUILD_GPU guards to gpu coalescer models since
a related cache recorder commit added GPU support. This is no longer
needed since the cache recorder moved to using a vector of RubyPorts
instead of Sequencer/GPUCoalescer pointers. This commit removes
BUILD_GPU guards from the Ruby coalescer models
Change-Id: I23a7957d82524d6cd3483d22edfb35ac51796eca
Previously, the cache recorder used a vector of sequencer pointers to
access Ruby objects. A recent commit updated the cache recorder to also
maintain a vector of GPUCoalescer pointers in order for GPUs to support
flushin. This added redundant code to the cache recorder. This commit
replaces the sequencer and GPUCoalescer vectors with a vector of
RubyPort pointers so that the code does not contain redundant lines
Change-Id: Id5da33fb870f17bb9daef816cc43c0bcd70a8706
This is a standard compare and swap but implemented on vector memory
buffer instructions (i.e., it is the same as FLAT_ATOMIC_CMPSWAP with
MUBUF's special address calculation).
This was tested using a Tensile kernel, a backend for rocBLAS, which is
used by PyTorch and Tensorflow. Prior to this patch both ML frameworks
crashed. With this patch they both make forward progress.
Change-Id: Ie76447a72d210f81624e01e1fa374e41c2c21e06
This instruction is used by ML frameworks to prioritize certain
wavefronts. Since gem5 does not have any support for wavefront
scheduling based on priority (besides wavefront age), we ignore this
instruction and warn_once rather than calling panic. Since hardware can
override this priority anyways, we can be sure that ignoring the value
will not inhibit forward progress resulting in application hangs.
Change-Id: Ic5eef14f9685dd2b316c5cf76078bb78d5bfe3cc
This is a standard compare and swap but implemented on vector memory
buffer instructions (i.e., it is the same as FLAT_ATOMIC_CMPSWAP with
MUBUF's special address calculation).
This was tested using a Tensile kernel, a backend for rocBLAS, which is
used by PyTorch and Tensorflow. Prior to this patch both ML frameworks
crashed. With this patch they both make forward progress.
Change-Id: Ie76447a72d210f81624e01e1fa374e41c2c21e06
This adds the [pyupgrade](https://github.com/asottile/pyupgrade) hook to
pre-commit.
This hook automatically upgrades the syntax to the recommended standards
for the newer version of the language.
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.
The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.
To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.
The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.
Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
Simplify indirect predictor interface. Several of the existing
functions where merged together into four clear once. Those
four are similar to the main direction predictor interface.
'lookup', 'update', 'squash' and 'commit'. This makes the
interface much more clear, allows better functionality isolation
and makes it simpler to develop new predictor models.
A new parameter is added to allow additional buffer space for
speculative path history.
Change-Id: I6d6b43965b2986ef959953a64c428e50bc68d38e
Signed-off-by: David Schall <david.schall@ed.ac.uk>
Now:
* The Atlas Client will attempt a connection 4 times, using an
exponential backoff approach between attempts.
* When a failure does arise a rich output is given so problems can be
easily diagnosed.
Addresses: #340
The RISC-V vector instructions still work without setRegOperand.
We should fix the register statistic issue by
https://github.com/gem5/gem5/pull/360 to avoid duplicate statistic
register write count
Change-Id: Ib6a52935e00c3e557b366abfcf60450dca05614d
Memory instructions acquire coalescer tokens in the schedule stage.
Currently this is only done for buffer and flat instructions, but not
flat global or flat scratch. This change now acquires tokens for flat
global and flat scratch instructions. This provides back-pressure to the
CUs and helps to avoid deadlocks in Ruby.
The change also handles returning tokens for buffer, flat global, and
flat scratch instructions. This was previously only being done for
normal flat instructions leading to deadlocks in some applications when
the tokens were exhausted.
To simplify the logic, added a needsToken() method to GPUDynInst which
return if the instruction is buffer or any flat segment.
The waitcnts were also incorrect for flat global and flat scratch. We
should always decrement vmem and exp count for stores and only normal
flat instructions should decrement lgkm. Currently vmem/exp are not
decremented for flat global and flat scratch which can lead to deadlock.
This change set fixes this by always decrementing vmem/exp and lgkm only
for normal flat instructions.
Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5
Earlier, GPU checkpointing was working only if a checkpoint was created
before the first kernel execution. This pull request adds support to
checkpoint in-between any two kernel calls. It does so by doing the
following.
- Adds flush support in the GPU_VIPER protocol
- Adds flush support in the GPUCoalescer
- Updates cache recorder to use the GPUCoalescer during simulation
cooldown and cache warmup times.
The new implementation matches the table in the ARM Architecture
Reference Manual (version DDI 0487J.a, section D1.3.6, table R_SXLWJ)
It takes into consideration features like FEAT_SEL2 (scr.eel2 bit) and
FEAT_VHE (hcr.e2h bit) which affect the masking of interrupts under
certain circumstances
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Change-Id: I07ebd8d859651475bd32fd201eea0f4e64a7dd5f
We pay a small duplication cost but we make the code
more readable and we enable further modifications to the
AArch64 code without forcing the same code on the AArch32
method
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Change-Id: I1efa33cf19f91094fd33bd48b6a0a57d8df8f89f
Now:
* The Atlas Client will attempt a connection 4 times, using an
exponential backoff approach between attempts.
* When a failure does arise a rich output is given so problems can be
easily diagnosed.
Change-Id: I3df332277c33a040c0ed734b9f3e28f38606af44
This PR adds two commit to handle timestamps in the ROCm runtime. ROCr
uses a mix of GPU timestamp reads and HSA packet timestamps to output
profiling information for a task dispatch.
The first patch added timestamps to the HSA completion signal indicating
when the task started and ended and require changing the flow of
completion signal DMAs to ensure the DMA of the timestamp values
completed before writing the completion signal value.
Second commit adds MMIOs for reading the GPU's timestamp counter. This
MMIO resides in the GFX MMIO space so a new class is added to handle
MMIOs in that address range.
Exposed in our failing compiler tests:
https://github.com/gem5/gem5/actions/runs/6348223508, this PR:
* Adds missing overrides to `PCState`'s `set` function.
* Removes `std::binary_function` from DramPower (it was deprecated in
CPP-11 and officially removed in CPP-17).
Added a parameter (_disk_device) to kernel_disk_workload which allows
users to change the disk device location. get_disk_device() now chooses
between the parameter and, if no parameter was passed, it calls a new
function _get_default_disk_device() which is implemented by each board
and has a default disk device according to each board, eg /dev/hda in
the x86_board. The previous way of setting a disk device still exists as
a default, however, with the new function users can now override this
default
This comment was left in the codebase in error. The
`set_se_binary_workload` function works fine with multi-threaded
applications. This hasn't been a restriction for some time.
This is the first PR in a series of enhancements to the BPU proposed in
#358.
However, I think putting everything into one PR is not nice to review
and prone to oversee I might did.
This PR restructures the BTB:
- A new abstract BTB class is created to enable different BTB
implementations. The new BTB class gets its own parameter and stats.
- An enum is added to differentiate branch instruction types. This enum
is used to enhance statistics and BPU management.
- The existing BTB is moved into `simple_btb` as default.
- An additional function is added to store the static instruction in the
BTB. This function is used for the decoupled front-end.
- Update configs to match new BTB parameters.
Modified the x86 KVM-in-SE syscall handler to flush the TLB following
each syscall, in case the page table has been modified. This is done by
reloading the value in %cr3. Doing this requires an intermediate GPR,
which we store in a new scratch buffer following the syscall code at
address `syscallDataBuf`.
GitHub issue: https://github.com/gem5/gem5/issues/409
- A new abstract BTB class is created to enable different BTB
implementations. The new BTB class gets its own parameter
and stats.
- An enum is added to differentiate branch instruction types.
This enum is used to enhance statistics and BPU management.
- The existing BTB is moved into `simple_btb` as default.
- An additional function is added to store the static instruction in
the BTB. This function is used for the decoupled front-end.
- Update configs to match new BTB parameters.
Change-Id: I99b29a19a1b57e59ea2b188ed7d62a8b79426529
Signed-off-by: David Schall <david.schall@ed.ac.uk>
This is still trying to completely remove any artifact
which implies virtualization is only supported in
non-secure mode (NS=1)
Change-Id: I83fed1c33cc745ecdf3c5ad60f4f356f3c58aad5
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
This info can be used during TLB invalidation
Change-Id: I81247e40b11745f0207178b52c47845ca1b92870
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
The syscall emulation of brk() incorrectly did not ensure that newly
allocated memory was zero-initialized, which Linux guarantees and which
seems to be the expectation of glibc's malloc() and free()
implementation. This patch fixes the incorrect behavior by zero-
initalizing all memory allocations via brk().
GitHub issue: https://github.com/gem5/gem5/issues/342
Change-Id: I53cf29d6f3f83285c8e813e18c06c2e9a69d7cc2
Modified the x86 KVM-in-SE syscall handler to flush the TLB following
each syscall, in case the page table has been modified. This is done
by reloading the value in %cr3. Doing this requires an intermediate
GPR, which we store in a new scratch buffer following the syscall code
at address `syscallDataBuf`.
GitHub issue: https://github.com/gem5/gem5/issues/409
Change-Id: Ibc20018c97ebb1794fa31a0c71e0857d661c7c9d
gem5::MemState::updateBrkRegion(), which is called during the syscall
emulation of brk, did not unmap deallocated heap pages when the brk
region is receding. Instead, it kept it mapped for simplicity. This
introduced a bug where subequent expansions of the brk region reused
prior heap page mappings that were not zero-filled. This violates
the assumptions of glibc malloc, resulting in heap corruption and
crashes.
This patch fixes the bug by always unmapping pages that are deallocated
during a call to brk() that reduces the heap size. This makes the
gem5::MemState::_endBrkPoint field obsolete, so this patch removes it.
GitHub issue: https://github.com/gem5/gem5/issues/342
Change-Id: Ib2244e1aa4d2a26666ad60d231fdde2c22d2df35
The ROCr runtime uses a combination of HSA signal timestamps and
hardware MMIOs to calculate profiling times. At the beginning of an
application a timestamp is read from the GPU using MMIOs. The clock
MMIOs reside in the GFX MMIO region, so a new AMDGPUGfx class is added
to handle these MMIOs.
The timestamp value is expected to be in nanoseconds, so we simply use
the gem5 tick converted to ns.
Change-Id: I7d1cba40d5042a7f7a81fd4d132402dc11b71bd4
The AMD specific HSA signal contains start/end timestamps for dispatch
packet completion signals. These are current always zero. These
timestamp values are used for profiling in the ROCr runtime.
Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check
for zero values before dividing, causing applications that use profiling
to crash with SIGFPE. Profiling is used via hipEvents in the HACC
application, so these should be supported in gem5.
In order to handle writing the timestamp values, we need to DMA the
values to memory before writing the completion signal. This changes the
flow of the async completion signal write to be (1) read mailbox pointer
(2) if valid, write the mailbox data, other skip to 4 (3) write mailbox
data if pointer is valid (4) write timestamp values (5) write completion
signal. The application will process the timestamp data as soon as the
completion signal is received, so we need to ordering to ensure the DMA
for timestamps was completed.
HACC now runs to completion on GPUFS and has the same output was
hardware.
Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e
Added a new feature to CHI protocol (in collaboration with @tiagormk).
Here is the Jira Ticket
[https://gem5.atlassian.net/browse/GEM5-1326](https://gem5.atlassian.net/browse/GEM5-1326
). As described in CHI specs, far atomic transactions enable remote
execution of Atomic Memory Operations. This pull request incorporates
several changes:
* Fix Arm ISA definition of Swap instructions. These instructions should
return an operand, so their ISA definition should be Return Operation.
* Enable AMOs in Ruby Mem Test to verify that AMOs work
* Enable near and far AMO in the Cache Controler of CHI
Three configuration parameters have been used to tune this behavior:
* policy_type: sets the atomic policy to one of the described in [our
paper](https://dl.acm.org/doi/10.1145/3579371.3589065)
* atomic_op_latency: simulates the AMO ALU operation latency
* comp_anr: configures the Atomic No return transaction to split
CompDBIDResp into two different messages DBIDResp and Comp
Ruby was recently updated to support flushes and warmup for GPUs. Since
this support uses the GPUCoalescer, non-GPU builds face a compile time
issue. This is because GPU code is not built for non-GPU builds. This
commit addes "#if BUILD_GPU" guards around the GPU-related code in
common files like AbstractController.hh, CacheRecorder.*, RubySystem.cc,
GPUCoalescer.hh, and VIPERCoalescer.hh. This support allows GPU builds
to use flushing while non-GPU builds compile without problems
Change-Id: If8ee4ff881fe154553289e8c00881ee1b6e3f113
ROCm supports dynamically allocating scratch space, which resides in
framebuffer memory, to reduce the amount of memory allocated for kernels
that have not yet launched. The size of the scratch space allocated is
located in task->amdQueue.compute_tmpring_size_wavesize. This size is in
kilobytes. The AQL task contains the number of bytes requested *per work
item*, however we currently check if there is enough tmpring space by
comparing a single work item. This should instead check the size *per
wavefront*.
This causes problems in applications where multiple kernels use dynamic
scratch allocation and a later kernel requires more space than the
earlier kernel. The only application being tested that does this is
LULESH. This was resulting in the scratch space being too small,
resulting in workgroups clobbering each other's private memory leading
to some nasty bugs. It is fixed by this patch as task->amdQueue will be
re-read from the host and will contain the updated tmpring size. After
this there is enough scratch space and LULESH makes forward progress.