Earlier, GPU checkpointing was working only if a checkpoint was created
before the first kernel execution. This pull request adds support to
checkpoint in-between any two kernel calls. It does so by doing the
following.
- Adds flush support in the GPU_VIPER protocol
- Adds flush support in the GPUCoalescer
- Updates cache recorder to use the GPUCoalescer during simulation
cooldown and cache warmup times.
The new implementation matches the table in the ARM Architecture
Reference Manual (version DDI 0487J.a, section D1.3.6, table R_SXLWJ)
It takes into consideration features like FEAT_SEL2 (scr.eel2 bit) and
FEAT_VHE (hcr.e2h bit) which affect the masking of interrupts under
certain circumstances
The new implementation matches the table in the ARM Architecture
Reference Manual (version DDI 0487J.a, section D1.3.6, table R_SXLWJ)
It takes into consideration features like FEAT_SEL2 (scr.eel2 bit) and
FEAT_VHE (hcr.e2h bit) which affect the masking of interrupts under
certain circumstances
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Change-Id: I07ebd8d859651475bd32fd201eea0f4e64a7dd5f
We pay a small duplication cost but we make the code
more readable and we enable further modifications to the
AArch64 code without forcing the same code on the AArch32
method
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Change-Id: I1efa33cf19f91094fd33bd48b6a0a57d8df8f89f
This `mkdir` is problematic as it doesn't create the directory
recursively. This casues errors if `dir` is `X/Y/Z` and both `Y` and `Z`
has not been created. An error will be returned (`No such file or
directory`).
This issue was fixed with: https://github.com/gem5/gem5/pull/263. The
checkpointing code already recursively creates directories as needed.
Ergo was can remove this `mkdir` statement.
This PR adds two commit to handle timestamps in the ROCm runtime. ROCr
uses a mix of GPU timestamp reads and HSA packet timestamps to output
profiling information for a task dispatch.
The first patch added timestamps to the HSA completion signal indicating
when the task started and ended and require changing the flow of
completion signal DMAs to ensure the DMA of the timestamp values
completed before writing the completion signal value.
Second commit adds MMIOs for reading the GPU's timestamp counter. This
MMIO resides in the GFX MMIO space so a new class is added to handle
MMIOs in that address range.
Current [TraceCPU
documentation](https://www.gem5.org/documentation/general_docs/cpu_models/TraceCPU)
still references the deprecated **se.py/fs.py** scripts for elastic
trace generation (script paths are also outdated).
With this PR we provide a simpler Arm based elastic trace generation
script that can
be used out of the box by a user or that can be extended as needed.
This `mkdir` is problematic as it doesn't create the directory
recursively. This casues errors if `dir` is `X/Y/Z` and both `Y` and `Z`
has not been created. An error will be returned (`No such file or
directory`).
This issue was fixed with: https://github.com/gem5/gem5/pull/263. The
checkpointing code already recursively creates directories as needed.
Ergo was can remove this `mkdir` statement.
Change-Id: Ibae38267c8ee1eba76d7834367aa1c54013365bc
This automatically formats .yaml files. By deault has the following
parameters:
* `mapping`: 4 spaces.
* `sequence`: 6 spaces.
* `offset`: 4 spaces.
* `colons`: do not align top-level colons.
* `width`: None.
Change-Id: Iee5194cd57b7b162fd7e33aa6852b64c5578a3d2
Exposed in our failing compiler tests:
https://github.com/gem5/gem5/actions/runs/6348223508, this PR:
* Adds missing overrides to `PCState`'s `set` function.
* Removes `std::binary_function` from DramPower (it was deprecated in
CPP-11 and officially removed in CPP-17).
Added a parameter (_disk_device) to kernel_disk_workload which allows
users to change the disk device location. get_disk_device() now chooses
between the parameter and, if no parameter was passed, it calls a new
function _get_default_disk_device() which is implemented by each board
and has a default disk device according to each board, eg /dev/hda in
the x86_board. The previous way of setting a disk device still exists as
a default, however, with the new function users can now override this
default
This comment was left in the codebase in error. The
`set_se_binary_workload` function works fine with multi-threaded
applications. This hasn't been a restriction for some time.
This is the first PR in a series of enhancements to the BPU proposed in
#358.
However, I think putting everything into one PR is not nice to review
and prone to oversee I might did.
This PR restructures the BTB:
- A new abstract BTB class is created to enable different BTB
implementations. The new BTB class gets its own parameter and stats.
- An enum is added to differentiate branch instruction types. This enum
is used to enhance statistics and BPU management.
- The existing BTB is moved into `simple_btb` as default.
- An additional function is added to store the static instruction in the
BTB. This function is used for the decoupled front-end.
- Update configs to match new BTB parameters.
The new script will automatically use the newly
defined O3_ARM_v7a_3_Etrace CPU to run a simple SE simulation while
generating elastic trace files.
The script is based on starter_se.py, but contains the following
limitations:
1) No L2 cache as it might affect computational delay calculations
2) Supporting SimpleMemory only with minimal memory latency
There restrictions were imported by the existing elastic trace
generation logic in the common library (collected by grepping
elastic_trace_en) [1][2][3]
Example usage:
build/ARM/gem5.opt configs/example/arm/etrace_se.py \
--inst-trace-file [INSTRUCTION TRACE] \
--data-trace-file [DATA TRACE] \
[WORKLOAD]
[1]: https://github.com/gem5/gem5/blob/stable/\
configs/common/MemConfig.py#L191
[2]: https://github.com/gem5/gem5/blob/stable/\
configs/common/MemConfig.py#L232
[3]: https://github.com/gem5/gem5/blob/stable/\
configs/common/CacheConfig.py#L130
Change-Id: I021fc84fa101113c5c2f0737d50a930bb4750f76
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-by: Richard Cooper <richard.cooper@arm.com>
According to the original paper [1] the elastic trace generation process
requires a cpu with a big number of entries in the ROB, LQ and SQ, so
that there are no stalls due to resource limitation.
At the moment these numbers are copy pasted from the
CpuConfig.config_etrace method [2].
[1]: https://ieeexplore.ieee.org/document/7818336
[2]: https://github.com/gem5/gem5/blob/stable/\
configs/common/CpuConfig.py#L40
Change-Id: I00fde49e5420e420a4eddb7b49de4b74360348c9
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-by: Richard Cooper <richard.cooper@arm.com>
We define a new parent (ClusterSystem) to model a system
with one or more cpu clusters within it.
The idea is to make this new base class reusable by SE
systems/scripts as well (like starter_se.py)
Change-Id: I1398d773813db565f6ad5ce62cb4c022cb12a55a
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
Reviewed-by: Richard Cooper <richard.cooper@arm.com>
Modified the x86 KVM-in-SE syscall handler to flush the TLB following
each syscall, in case the page table has been modified. This is done by
reloading the value in %cr3. Doing this requires an intermediate GPR,
which we store in a new scratch buffer following the syscall code at
address `syscallDataBuf`.
GitHub issue: https://github.com/gem5/gem5/issues/409
- A new abstract BTB class is created to enable different BTB
implementations. The new BTB class gets its own parameter
and stats.
- An enum is added to differentiate branch instruction types.
This enum is used to enhance statistics and BPU management.
- The existing BTB is moved into `simple_btb` as default.
- An additional function is added to store the static instruction in
the BTB. This function is used for the decoupled front-end.
- Update configs to match new BTB parameters.
Change-Id: I99b29a19a1b57e59ea2b188ed7d62a8b79426529
Signed-off-by: David Schall <david.schall@ed.ac.uk>
This is still trying to completely remove any artifact
which implies virtualization is only supported in
non-secure mode (NS=1)
Change-Id: I83fed1c33cc745ecdf3c5ad60f4f356f3c58aad5
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
This info can be used during TLB invalidation
Change-Id: I81247e40b11745f0207178b52c47845ca1b92870
Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>
The syscall emulation of brk() incorrectly did not ensure that newly
allocated memory was zero-initialized, which Linux guarantees and which
seems to be the expectation of glibc's malloc() and free()
implementation. This patch fixes the incorrect behavior by zero-
initalizing all memory allocations via brk().
GitHub issue: https://github.com/gem5/gem5/issues/342
Change-Id: I53cf29d6f3f83285c8e813e18c06c2e9a69d7cc2
The PR is fixing the CHI fromSequencer helper function which is making
use of the undefined tbe entry.
This has been broken by #177
Change-Id: I52feff4b5ab2faf0aa91edd6572e3e767c88e257
1. All VMs are deployable from a single Vagrantfile (per host machine).
2. Runners within VMs are now ephemeral. They cease to exist after a job
is complete. After the VM cleans the workspace and creates a new runner.
This will reduce old data, scripts, and images causing space issues on
our VMs
3. No more 'vm_manager.sh' script. The standard `vagrant` command to
manage the VMs will work.
4. Adds Copyright notices where missing.
This patch removed the bespoke "vm_manager.sh" script in favor of a
Multi-Machine Vagrantfile.
With this the users needs to only change the variables in Vagrantfile
then use the standard `vagrant` commands to launch the VMs/Runners.
Change-Id: Ida5d2701319fd844c6a5b6fa7baf0c48b67db975
Prior to this change we were limited to a root partition with only 60GB
of space which caused issues when running larger simulations (see:
https://github.com/gem5/gem5/issues/165).
There are two factors in this issue which this patch resolves:
1. The root partition in the VM was capped at 60GB despite the virtual
machines size being capped at 128GB. This resulted in libvirt giving the
VM free space it couldn't use. To fix this `lvextend` was added to the
"provision_root.sh" script to resize the root partition to fill the
available space.
2. The virtual machine size can be set via the `machine_virtual_size`
parameter. The minimum and default value is 128GB. This wasn't exposed
previously. Now, if we required, we can increase the size of the VM/Root
partition if we require (though I believe 128GB is more than sufficient
for now).
Fixes: https://github.com/gem5/gem5/issues/165
Change-Id: I82dd500d8807ee0164f92d91515729d5fbd598e3
The "action-run.sh" action replaces inline scripting in the Vagrantfile.
The major improvement is this script runs an infinite loop and
configures the runners to be ephemeral. This means they cease to exist
after a job is complete. The script then cleans the VM workspace and the
loop restarts by configuring and setting up another runner. This means
our VMs no longer accumulate files that eventually lead to the VM
running out of space.
Change-Id: Iba6dc9a480f5805042602f120fc84bdc47a96d55
There are two places self-hosted runners can exist on GitHub:
1. At the level of the repository: In this case the runners can only be
used by that repository and runners can only be distinguished from one
another by labels.
2. At the level of the organization: In this case the runners can be
used by any repository in the organization, thus increasing their
versatility. In addition to labels, runners in the level of the
organization can be organized into groups.
While we do not use our self-hosted runners on other repositories, there
may be future use for this, so we might as well enable it now.
Change-Id: Id5e113194314336221dcdc8c2858b352afcbaf6e
Having two types of GitHub Action Runners has not yielded much benefit
and caused confusion and inefficiencies. This change simplifies things
to having just one runner with 8-cores and 16GB of memory. It is
sufficient to build gem5 and run most simulations.
Change-Id: Ic49ae5e98b02086f153f4ae2a4eedd8a535786c8
Modified the x86 KVM-in-SE syscall handler to flush the TLB following
each syscall, in case the page table has been modified. This is done
by reloading the value in %cr3. Doing this requires an intermediate
GPR, which we store in a new scratch buffer following the syscall code
at address `syscallDataBuf`.
GitHub issue: https://github.com/gem5/gem5/issues/409
Change-Id: Ibc20018c97ebb1794fa31a0c71e0857d661c7c9d
gem5::MemState::updateBrkRegion(), which is called during the syscall
emulation of brk, did not unmap deallocated heap pages when the brk
region is receding. Instead, it kept it mapped for simplicity. This
introduced a bug where subequent expansions of the brk region reused
prior heap page mappings that were not zero-filled. This violates
the assumptions of glibc malloc, resulting in heap corruption and
crashes.
This patch fixes the bug by always unmapping pages that are deallocated
during a call to brk() that reduces the heap size. This makes the
gem5::MemState::_endBrkPoint field obsolete, so this patch removes it.
GitHub issue: https://github.com/gem5/gem5/issues/342
Change-Id: Ib2244e1aa4d2a26666ad60d231fdde2c22d2df35
The ROCr runtime uses a combination of HSA signal timestamps and
hardware MMIOs to calculate profiling times. At the beginning of an
application a timestamp is read from the GPU using MMIOs. The clock
MMIOs reside in the GFX MMIO region, so a new AMDGPUGfx class is added
to handle these MMIOs.
The timestamp value is expected to be in nanoseconds, so we simply use
the gem5 tick converted to ns.
Change-Id: I7d1cba40d5042a7f7a81fd4d132402dc11b71bd4
The AMD specific HSA signal contains start/end timestamps for dispatch
packet completion signals. These are current always zero. These
timestamp values are used for profiling in the ROCr runtime.
Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check
for zero values before dividing, causing applications that use profiling
to crash with SIGFPE. Profiling is used via hipEvents in the HACC
application, so these should be supported in gem5.
In order to handle writing the timestamp values, we need to DMA the
values to memory before writing the completion signal. This changes the
flow of the async completion signal write to be (1) read mailbox pointer
(2) if valid, write the mailbox data, other skip to 4 (3) write mailbox
data if pointer is valid (4) write timestamp values (5) write completion
signal. The application will process the timestamp data as soon as the
completion signal is received, so we need to ordering to ensure the DMA
for timestamps was completed.
HACC now runs to completion on GPUFS and has the same output was
hardware.
Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e
Added a new feature to CHI protocol (in collaboration with @tiagormk).
Here is the Jira Ticket
[https://gem5.atlassian.net/browse/GEM5-1326](https://gem5.atlassian.net/browse/GEM5-1326
). As described in CHI specs, far atomic transactions enable remote
execution of Atomic Memory Operations. This pull request incorporates
several changes:
* Fix Arm ISA definition of Swap instructions. These instructions should
return an operand, so their ISA definition should be Return Operation.
* Enable AMOs in Ruby Mem Test to verify that AMOs work
* Enable near and far AMO in the Cache Controler of CHI
Three configuration parameters have been used to tune this behavior:
* policy_type: sets the atomic policy to one of the described in [our
paper](https://dl.acm.org/doi/10.1145/3579371.3589065)
* atomic_op_latency: simulates the AMO ALU operation latency
* comp_anr: configures the Atomic No return transaction to split
CompDBIDResp into two different messages DBIDResp and Comp
Ruby was recently updated to support flushes and warmup for GPUs. Since
this support uses the GPUCoalescer, non-GPU builds face a compile time
issue. This is because GPU code is not built for non-GPU builds. This
commit addes "#if BUILD_GPU" guards around the GPU-related code in
common files like AbstractController.hh, CacheRecorder.*, RubySystem.cc,
GPUCoalescer.hh, and VIPERCoalescer.hh. This support allows GPU builds
to use flushing while non-GPU builds compile without problems
Change-Id: If8ee4ff881fe154553289e8c00881ee1b6e3f113
Previously, the L1, L2 number of banks and L2 latencies were not
configurable through command line arguments. This commit adds support to
configure them through the arguments '--tcp-num-banks' for number of
banks in L1, '--tcc-num-banks' for number of banks in L2, and
'--tcc-tag-access-latency', and '--tcc-data-access-latency'
Change-Id: Ie3b713ead16865fd7120e2d809ebfa56b69bc4a1