Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug where they do not behave correctly when multiple requests to the same cache line overlap one another. The intended behavior is that if the first request that arrives at the Ruby code for the SQC/TCP misses, it should send a request to the GPU TCC (L2$). If any requests to the same cache line occur while this first request is pending, they should wait locally at the L1 in the MSHRs (TBEs) until the first request has returned. At that point they can be serviced, and assuming the line has not been evicted, they should hit. For example, in the following test (on 1 GPU thread, in 1 WG): load Arr[0] load Arr[1] load Arr[2] The expected behavior (confirmed via profiling on real GPUs) is that we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a program. However, the current support in the VIPER SQC/TCP code does not model this correctly. Instead it lets all 3 concurrent requests go straight through to the TCC instead of stopping the Arr[1] and Arr[2] requests locally while Arr[0] is serviced. This causes all 3 requests to be classified as misses. To resolve this, this patch adds support into the SQC/TCP code to prevent subsequent, concurrent requests to a pending cache line from being sent in parallel with the original one. To do this, we add an additional transient state (IV) to indicate that a load is pending to this cache line. If a subsequent request of any kind to the same cache line occurs while this load is pending, the requests are put on the local wait buffer and woken up when the first request returns to the SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it transitions from IV --> V. As part of this support, additional transitions were also added to account for corner cases such as what happens when the line is evicted by another request that maps to the same set index while the first load is pending (the line is immediately given to the new request, and when the load returns it completes, wakes up any pending requests to the same line, but does not attempt to change the state of the line) and how GPU bypassing loads and stores should interact with the pending requests (they are forced to wait if they reach the L1 after the pending, non-bypassing load; but if they reach the L1 before the non-bypassing load then they make sure not to change the state of the line from IV if they return before the non-bypassing load). As part of this change, we also move the MSHR behavior from internally in the GPUCoalescer for loads to the Ruby code (like all other requests). This is important to get correct hits and misses in stats and other prints, since the GPUCoalescer MSHR behavior assumes all requests serviced out of its MSHR also miss if the original request to that line missed. Although the SQC does not support stores, the TCP does. Thus, we could have applied a similar change to the GPU stores at the TCP. However, since the TCP support assumes write-through caches and does not attempt to allocate space in the TCP, we elected not to add this support since it seems to run contrary to the intended behavior (i.e., the intended behavior seems to be that writes just bypass the TCP and thus should not need to wait for another write to the same cache line to complete). Additionally, making these changes introduced issues with deadlocks at the TCC. Specifically, some Pannotia applications have accesses to the same cache line where some of the accesses are GLC (i.e., they bypass the GPU L1 cache) and others are non-GLC (i.e., they want to be cached in the GPU L1 cache). We have support already per CU in the above code. However, the problem here is that these requests are coming from different CUs and happening concurrently (seemingly because different WGs are at different points in the kernel around the same time). This causes a problem because our support at the TCC for the TBEs overwrites the information about the GPU bypassing bits (SLC, GLC) every time. The problem is when the second (non-GLC) load reaches the TCC, it overwrites the SLC/GLC information for the first (GLC) load. Thus, when the the first load returns from the directory/memory, it no longer has the GLC bit set, which causes an assert failure at the TCP. After talking with other developers, it was decided the best way handle this and attempt to model real hardware more closely was to move the point at which requests are put to sleep on the wakeup buffer from the TCC to the directory. Accordingly, this patch includes support for that -- now when multiple loads (bypassing or non-bypassing) from different CUs reach the directory, all but the first one will be forced to wait there until the first one completes, then will be woken up and performed. This required updating the WTRequestor information at the TCC to pass the information about what CU performed the original request for loads as well (otherwise since the TBE can be updated by multiple pending loads, we can't tell where to send the final result to). Thus, I changed the field to be named CURequestor instead of WTRequestor since it is now used for more than stores. Moreover, I also updated the directory to take this new field and the GLC information from incoming TCC requests and then pass that information back to the TCC on the response -- without doing this, because the TBE can be updated by multiple pending, concurrent requests we cannot determine if this memory request was a bypassing or non-bypassing request. Finally, these changes introduced a lot of additional contention and protocol stalls at the directory, so this patch converted all directory uses of z_stall to instead put requests on the wakeup buffer (and wake them up when the current request completes) instead. Without this, protocol stalls cause many applications to deadlock at the directory. However, this exposed another issue at the TCC: other applications (e.g., HACC) have a mix of atomics and non-atomics to the same cache line in the same kernel. Since the TCC transitions to the A state when an atomic arrives. For example, after the first pending load returns to the TCC from the directory, which causes the TCC state to become V, but when there are still other pending loads at the TCC. This causes invalid transition errors at the TCC when those pending loads return, because the A state thinks they are atomics and decrements the pending atomic count (plus the loads are never sent to the TCP as returning loads). This patch fixes this by changing the TCC TBEs to model the number of pending requests, and not allowing atomics to be issued from the TCC until all prior, pending non-atomic requests have returned. Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
The gem5 Simulator
This is the repository for the gem5 simulator. It contains the full source code for the simulator and all tests and regressions.
The gem5 simulator is a modular platform for computer-system architecture research, encompassing system-level architecture as well as processor microarchitecture. It is primarily used to evaluate new hardware designs, system software changes, and compile-time and run-time system optimizations.
The main website can be found at http://www.gem5.org.
Testing status
Note: These regard tests run on the develop branch of gem5: https://github.com/gem5/gem5/tree/develop.
Getting started
A good starting point is http://www.gem5.org/about, and for more information about building the simulator and getting started please see http://www.gem5.org/documentation and http://www.gem5.org/documentation/learning_gem5/introduction.
Building gem5
To build gem5, you will need the following software: g++ or clang, Python (gem5 links in the Python interpreter), SCons, zlib, m4, and lastly protobuf if you want trace capture and playback support. Please see http://www.gem5.org/documentation/general_docs/building for more details concerning the minimum versions of these tools.
Once you have all dependencies resolved, execute
scons build/ALL/gem5.opt to build an optimized version of the gem5 binary
(gem5.opt) containing all gem5 ISAs. If you only wish to compile gem5 to
include a single ISA, you can replace ALL with the name of the ISA. Valid
options include ARM, NULL, MIPS, POWER, RISCV, SPARC, and X86
The complete list of options can be found in the build_opts directory.
See https://www.gem5.org/documentation/general_docs/building for more information on building gem5.
The Source Tree
The main source tree includes these subdirectories:
- build_opts: pre-made default configurations for gem5
- build_tools: tools used internally by gem5's build process.
- configs: example simulation configuration scripts
- ext: less-common external packages needed to build gem5
- include: include files for use in other programs
- site_scons: modular components of the build system
- src: source code of the gem5 simulator. The C++ source, Python wrappers, and Python standard library are found in this directory.
- system: source for some optional system software for simulated systems
- tests: regression tests
- util: useful utility programs and files
gem5 Resources
To run full-system simulations, you may need compiled system firmware, kernel binaries and one or more disk images, depending on gem5's configuration and what type of workload you're trying to run. Many of these resources can be obtained from https://resources.gem5.org.
More information on gem5 Resources can be found at https://www.gem5.org/documentation/general_docs/gem5_resources/.
Getting Help, Reporting bugs, and Requesting Features
We provide a variety of channels for users and developers to get help, report bugs, requests features, or engage in community discussions. Below are a few of the most common we recommend using.
- GitHub Discussions: A GitHub Discussions page. This can be used to start discussions or ask questions. Available at https://github.com/orgs/gem5/discussions.
- GitHub Issues: A GitHub Issues page for reporting bugs or requesting features. Available at https://github.com/gem5/gem5/issues.
- Jira Issue Tracker: A Jira Issue Tracker for reporting bugs or requesting features. Available at https://gem5.atlassian.net/.
- Slack: A Slack server with a variety of channels for the gem5 community to engage in a variety of discussions. Please visit https://www.gem5.org/join-slack to join.
- gem5-users@gem5.org: A mailing list for users of gem5 to ask questions or start discussions. To join the mailing list please visit https://www.gem5.org/mailing_lists.
- gem5-dev@gem5.org: A mailing list for developers of gem5 to ask questions or start discussions. To join the mailing list please visit https://www.gem5.org/mailing_lists.
Contributing to gem5
We hope you enjoy using gem5. When appropriate we advise charing your contributions to the project. https://www.gem5.org/contributing can help you get started. Additional information can be found in the CONTRIBUTING.md file.