Matt Sinclair c3326c78e6 mem-ruby, gpu-compute: fix SQC/TCP requests to same line
Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug
where they do not behave correctly when multiple requests to the same
cache line overlap one another.  The intended behavior is that if the
first request that arrives at the Ruby code for the SQC/TCP misses, it
should send a request to the GPU TCC (L2$).  If any requests to the
same cache line occur while this first request is pending, they should
wait locally at the L1 in the MSHRs (TBEs) until the first request has
returned.  At that point they can be serviced, and assuming the line
has not been evicted, they should hit.

For example, in the following test (on 1 GPU thread, in 1 WG):

load Arr[0]
load Arr[1]
load Arr[2]

The expected behavior (confirmed via profiling on real GPUs) is that
we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a
program.

However, the current support in the VIPER SQC/TCP code does not model
this correctly.  Instead it lets all 3 concurrent requests go straight
through to the TCC instead of stopping the Arr[1] and Arr[2] requests
locally while Arr[0] is serviced.  This causes all 3 requests to be
classified as misses.

To resolve this, this patch adds support into the SQC/TCP code to
prevent subsequent, concurrent requests to a pending cache line from being
sent in parallel with the original one.  To do this, we add an
additional transient state (IV) to indicate that a load is pending to
this cache line.  If a subsequent request of any kind to the same cache
line occurs while this load is pending, the requests are put on the
local wait buffer and woken up when the first request returns to the
SQC/TCP.  Likewise, when the first load is returned to the SQC/TCP, it
transitions from IV --> V.

As part of this support, additional transitions were also added to
account for corner cases such as what happens when the line is evicted
by another request that maps to the same set index while the first load
is pending (the line is immediately given to the new request, and when
the load returns it completes, wakes up any pending requests to the same
line, but does not attempt to change the state of the line) and how GPU
bypassing loads and stores should interact with the pending requests
(they are forced to wait if they reach the L1 after the pending,
non-bypassing load; but if they reach the L1 before the non-bypassing
load then they make sure not to change the state of the line from IV if
they return before the non-bypassing load).

As part of this change, we also move the MSHR behavior from internally
in the GPUCoalescer for loads to the Ruby code (like all other
requests).  This is important to get correct hits and misses in stats
and other prints, since the GPUCoalescer MSHR behavior assumes all
requests serviced out of its MSHR also miss if the original request to
that line missed.

Although the SQC does not support stores, the TCP does.  Thus,
we could have applied a similar change to the GPU stores at the TCP.
However, since the TCP support assumes write-through caches and does not
attempt to allocate space in the TCP, we elected not to add this support
since it seems to run contrary to the intended behavior (i.e., the
intended behavior seems to be that writes just bypass the TCP and thus
should not need to wait for another write to the same cache line to
complete).

Additionally, making these changes introduced issues with deadlocks at
the TCC.  Specifically, some Pannotia applications have accesses to the
same cache line where some of the accesses are GLC (i.e., they bypass
the GPU L1 cache) and others are non-GLC (i.e., they want to be cached
in the GPU L1 cache). We have support already per CU in the above code.
However, the problem here is that these requests are coming from
different CUs and happening concurrently (seemingly because different
WGs are at different points in the kernel around the same time).
This causes a problem because our support at the TCC for the TBEs
overwrites the information about the GPU bypassing bits (SLC, GLC) every
time. The problem is when the second (non-GLC) load reaches the TCC, it
overwrites the SLC/GLC information for the first (GLC) load. Thus, when
the the first load returns from the directory/memory, it no longer has
the GLC bit set, which causes an assert failure at the TCP.

After talking with other developers, it was decided the best way handle
this and attempt to model real hardware more closely was to move the
point at which requests are put to sleep on the wakeup buffer from the
TCC to the directory. Accordingly, this patch includes support for that
-- now when multiple loads (bypassing or non-bypassing) from different
CUs reach the directory, all but the first one will be forced to wait
there until the first one completes, then will be woken up and
performed.  This required updating the WTRequestor information at the
TCC to pass the information about what CU performed the original request
for loads as well (otherwise since the TBE can be updated by multiple
pending loads, we can't tell where to send the final result to).  Thus,
I changed the field to be named CURequestor instead of WTRequestor since
it is now used for more than stores.  Moreover, I also updated the
directory to take this new field and the GLC information from incoming
TCC requests and then pass that information back to the TCC on the
response -- without doing this, because the TBE can be updated by
multiple pending, concurrent requests we cannot determine if this memory
request was a bypassing or non-bypassing request.  Finally, these
changes introduced a lot of additional contention and protocol stalls at
the directory, so this patch converted all directory uses of z_stall to
instead put requests on the wakeup buffer (and wake them up when the
current request completes) instead. Without this, protocol stalls cause
many applications to deadlock at the directory.

However, this exposed another issue at the TCC: other applications
(e.g., HACC) have a mix of atomics and non-atomics to the same cache
line in the same kernel.  Since the TCC transitions to the A state when
an atomic arrives. For example, after the first pending load returns to
the TCC from the directory, which causes the TCC state to become V, but
when there are still other pending loads at the TCC. This causes invalid
transition errors at the TCC when those pending loads return, because
the A state thinks they are atomics and decrements the pending atomic
count (plus the loads are never sent to the TCP as returning loads).
This patch fixes this by changing the TCC TBEs to model the number of
pending requests, and not allowing atomics to be issued from the TCC
until all prior, pending non-atomic requests have returned.

Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-15 19:23:51 -06:00
2022-12-08 00:26:01 +00:00
2020-07-14 18:41:37 +00:00
2022-07-05 17:29:28 +00:00

The gem5 Simulator

This is the repository for the gem5 simulator. It contains the full source code for the simulator and all tests and regressions.

The gem5 simulator is a modular platform for computer-system architecture research, encompassing system-level architecture as well as processor microarchitecture. It is primarily used to evaluate new hardware designs, system software changes, and compile-time and run-time system optimizations.

The main website can be found at http://www.gem5.org.

Testing status

Note: These regard tests run on the develop branch of gem5: https://github.com/gem5/gem5/tree/develop.

Daily Tests Weekly Tests Compiler Tests

Getting started

A good starting point is http://www.gem5.org/about, and for more information about building the simulator and getting started please see http://www.gem5.org/documentation and http://www.gem5.org/documentation/learning_gem5/introduction.

Building gem5

To build gem5, you will need the following software: g++ or clang, Python (gem5 links in the Python interpreter), SCons, zlib, m4, and lastly protobuf if you want trace capture and playback support. Please see http://www.gem5.org/documentation/general_docs/building for more details concerning the minimum versions of these tools.

Once you have all dependencies resolved, execute scons build/ALL/gem5.opt to build an optimized version of the gem5 binary (gem5.opt) containing all gem5 ISAs. If you only wish to compile gem5 to include a single ISA, you can replace ALL with the name of the ISA. Valid options include ARM, NULL, MIPS, POWER, RISCV, SPARC, and X86 The complete list of options can be found in the build_opts directory.

See https://www.gem5.org/documentation/general_docs/building for more information on building gem5.

The Source Tree

The main source tree includes these subdirectories:

  • build_opts: pre-made default configurations for gem5
  • build_tools: tools used internally by gem5's build process.
  • configs: example simulation configuration scripts
  • ext: less-common external packages needed to build gem5
  • include: include files for use in other programs
  • site_scons: modular components of the build system
  • src: source code of the gem5 simulator. The C++ source, Python wrappers, and Python standard library are found in this directory.
  • system: source for some optional system software for simulated systems
  • tests: regression tests
  • util: useful utility programs and files

gem5 Resources

To run full-system simulations, you may need compiled system firmware, kernel binaries and one or more disk images, depending on gem5's configuration and what type of workload you're trying to run. Many of these resources can be obtained from https://resources.gem5.org.

More information on gem5 Resources can be found at https://www.gem5.org/documentation/general_docs/gem5_resources/.

Getting Help, Reporting bugs, and Requesting Features

We provide a variety of channels for users and developers to get help, report bugs, requests features, or engage in community discussions. Below are a few of the most common we recommend using.

Contributing to gem5

We hope you enjoy using gem5. When appropriate we advise charing your contributions to the project. https://www.gem5.org/contributing can help you get started. Additional information can be found in the CONTRIBUTING.md file.

Description
No description provided
Readme 272 MiB