dev-amdgpu,mem-ruby: Add support to checkpoint and restore between kernels in GPUFS (#377)

Earlier, GPU checkpointing was working only if a checkpoint was created
before the first kernel execution. This pull request adds support to
checkpoint in-between any two kernel calls. It does so by doing the
following.

- Adds flush support in the GPU_VIPER protocol
- Adds flush support in the GPUCoalescer
- Updates cache recorder to use the GPUCoalescer during simulation
cooldown and cache warmup times.
This commit is contained in:
Matt Sinclair
2023-10-10 09:41:21 -05:00
committed by GitHub
14 changed files with 381 additions and 38 deletions

View File

@@ -158,6 +158,16 @@ def addRunFSOptions(parser):
help="Root partition of disk image",
)
parser.add_argument(
"--disable-avx",
action="store_true",
default=False,
help="Disables AVX. AVX is used in some ROCm libraries but "
"does not have checkpointing support yet. If simulation either "
"creates a checkpoint or restores from one, then AVX needs to "
"be disabled for correct functionality ",
)
def runGpuFSSystem(args):
"""

View File

@@ -234,7 +234,7 @@ def makeGpuFSSystem(args):
# If we are using KVM cpu, enable AVX. AVX is used in some ROCm libraries
# such as rocBLAS which is used in higher level libraries like PyTorch.
use_avx = False
if ObjectList.is_kvm_cpu(TestCPUClass):
if ObjectList.is_kvm_cpu(TestCPUClass) and not args.disable_avx:
# AVX also requires CR4.osxsave to be 1. These must be set together
# of KVM will error out.
system.workload.enable_osxsave = 1