derek/gem5 - gem5 - Gitea: Git with a cup of tea

derek/gem5

Author	SHA1	Message	Date
Nitesh Narayana	d962d2588d	arch-arm: This commit cleans .isa files This commit cleans extra new lines from .isa files from this branch Change-Id: I4087ed230aa041747038b49360c2aba3f82c0790	2023-12-06 16:03:21 +01:00
Nitesh Narayana	db8e1652e8	arch-arm: This commit uses existing template code for mla/s index This includes mla/s index version implementation using the existing template code to avoid code repeatition. Change-Id: If1de84e01dec638e206c979ca832308ebc904212	2023-12-05 23:40:06 +01:00
Nitesh Narayana	35ccd7f907	arch-arm: This commit adds the mla/s indexed versions This includes the isa and instruction implementations of mla and mls indexed versions from ARM SVE2 ISA spec. Change-Id: I4fbd0382f23d8611e46411f74dc991f5a211a313	2023-11-24 15:20:30 +01:00
Jason Lowe-Power	db6a869786	mem-cache: Prefetchers Improvements (#564 ) This pull request contains a set of small patches which fix some bugs in the gem5 prefetchers, and aligns out-of-the box prefetcher performance more closely with that which a typical user would expect. The performance patches have been tested with an out-of-the-box (untuned) Stride prefetcher configuration against a set of SPEC 2017 SimPoints, and show a modest IPC uplift across the board, with no IPC degradation. The new defaults were identified as part of work on gem5 prefetchers undertaken by Nikolaos Kyparissas while on internship at Arm.	2023-11-16 15:22:26 -08:00
Giacomo Travaglini	4ca2efac16	mem-ruby: AtomicNoReturn should check comp_anr instead of comp_wu (#545 ) The comp_anr parameter is currently unused. Both parameters (comp_wu and comp_anr) are set to false by default Change-Id: If09567504540dbee082191d46fcd53f1363d819f Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>	2023-11-16 15:20:51 -08:00
Matthew Poremba	4965367724	mem-ruby, gpu-compute: fix SQC/TCP requests to same line (#540 ) Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug where they do not behave correctly when multiple requests to the same cache line overlap one another. The intended behavior is that if the first request that arrives at the Ruby code for the SQC/TCP misses, it should send a request to the GPU TCC (L2$). If any requests to the same cache line occur while this first request is pending, they should wait locally at the L1 in the MSHRs (TBEs) until the first request has returned. At that point they can be serviced, and assuming the line has not been evicted, they should hit. For example, in the following test (on 1 GPU thread, in 1 WG): load Arr[0] load Arr[1] load Arr[2] The expected behavior (confirmed via profiling on real GPUs) is that we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a program. However, the current support in the VIPER SQC/TCP code does not model this correctly. Instead it lets all 3 concurrent requests go straight through to the TCC instead of stopping the Arr[1] and Arr[2] requests locally while Arr[0] is serviced. This causes all 3 requests to be classified as misses. To resolve this, this patch adds support into the SQC/TCP code to prevent subsequent, concurrent requests to a pending cache line from being sent in parallel with the original one. To do this, we add an additional transient state (IV) to indicate that a load is pending to this cache line. If a subsequent request of any kind to the same cache line occurs while this load is pending, the requests are put on the local wait buffer and woken up when the first request returns to the SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it transitions from IV --> V. As part of this support, additional transitions were also added to account for corner cases such as what happens when the line is evicted by another request that maps to the same set index while the first load is pending (the line is immediately given to the new request, and when the load returns it completes, wakes up any pending requests to the same line, but does not attempt to change the state of the line) and how GPU bypassing loads and stores should interact with the pending requests (they are forced to wait if they reach the L1 after the pending, non-bypassing load; but if they reach the L1 before the non-bypassing load then they make sure not to change the state of the line from IV if they return before the non-bypassing load). As part of this change, we also move the MSHR behavior from internally in the GPUCoalescer for loads to the Ruby code (like all other requests). This is important to get correct hits and misses in stats and other prints, since the GPUCoalescer MSHR behavior assumes all requests serviced out of its MSHR also miss if the original request to that line missed. Although the SQC does not support stores, the TCP does. Thus, we could have applied a similar change to the GPU stores at the TCP. However, since the TCP support assumes write-through caches and does not attempt to allocate space in the TCP, we elected not to add this support since it seems to run contrary to the intended behavior (i.e., the intended behavior seems to be that writes just bypass the TCP and thus should not need to wait for another write to the same cache line to complete). Additionally, making these changes introduced issues with deadlocks at the TCC. Specifically, some Pannotia applications have accesses to the same cache line where some of the accesses are GLC (i.e., they bypass the GPU L1 cache) and others are non-GLC (i.e., they want to be cached in the GPU L1 cache). We have support already per CU in the above code. However, the problem here is that these requests are coming from different CUs and happening concurrently (seemingly because different WGs are at different points in the kernel around the same time). This causes a problem because our support at the TCC for the TBEs overwrites the information about the GPU bypassing bits (SLC, GLC) every time. The problem is when the second (non-GLC) load reaches the TCC, it overwrites the SLC/GLC information for the first (GLC) load. Thus, when the the first load returns from the directory/memory, it no longer has the GLC bit set, which causes an assert failure at the TCP. After talking with other developers, it was decided the best way handle this and attempt to model real hardware more closely was to move the point at which requests are put to sleep on the wakeup buffer from the TCC to the directory. Accordingly, this patch includes support for that -- now when multiple loads (bypassing or non-bypassing) from different CUs reach the directory, all but the first one will be forced to wait there until the first one completes, then will be woken up and performed. This required updating the WTRequestor information at the TCC to pass the information about what CU performed the original request for loads as well (otherwise since the TBE can be updated by multiple pending loads, we can't tell where to send the final result to). Thus, I changed the field to be named CURequestor instead of WTRequestor since it is now used for more than stores. Moreover, I also updated the directory to take this new field and the GLC information from incoming TCC requests and then pass that information back to the TCC on the response -- without doing this, because the TBE can be updated by multiple pending, concurrent requests we cannot determine if this memory request was a bypassing or non-bypassing request. Finally, these changes introduced a lot of additional contention and protocol stalls at the directory, so this patch converted all directory uses of z_stall to instead put requests on the wakeup buffer (and wake them up when the current request completes) instead. Without this, protocol stalls cause many applications to deadlock at the directory. However, this exposed another issue at the TCC: other applications (e.g., HACC) have a mix of atomics and non-atomics to the same cache line in the same kernel. Since the TCC transitions to the A state when an atomic arrives. For example, after the first pending load returns to the TCC from the directory, which causes the TCC state to become V, but when there are still other pending loads at the TCC. This causes invalid transition errors at the TCC when those pending loads return, because the A state thinks they are atomics and decrements the pending atomic count (plus the loads are never sent to the TCP as returning loads). This patch fixes this by changing the TCC TBEs to model the number of pending requests, and not allowing atomics to be issued from the TCC until all prior, pending non-atomic requests have returned. Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563	2023-11-16 14:24:00 -08:00
Bobby R. Bruce	bfe899e48e	stdlib, resources: Update JSON data in workload (#532 ) - resources field in workload now supports a dict with resources id and version. - Older workload JSON are still supported but added a deprecation waring	2023-11-16 10:11:13 -08:00
Giacomo Travaglini	047a494c2b	mem-cache: Optimize strided prefetcher address generation This commit optimizes the address generation logic in the strided prefetcher by introducing the following changes (d is the degree of the prefetcher) * Evaluate the fixed prefetch_stride only once (and not d-times) * Replace 2d multiplications (d * prefetch_stride and distance * prefetch_stride) with additions by updating the new base prefetch address while looping Change-Id: I49c52333fc4c7071ac3d73443f2ae07bfcd5b8e4 Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com> Reviewed-by: Richard Cooper <richard.cooper@arm.com> Reviewed-by: Tiberiu Bucur <tiberiu.bucur@arm.com>	2023-11-16 09:48:15 +00:00
Nikolaos Kyparissas	2abd65c270	mem: added distance parameter to stride prefetcher The Stride Prefetcher will skip this number of strides ahead of the first identified prefetch, then generate `degree` prefetches at `stride` intervals. A value of zero indicates no skip (i.e. start prefetching from the next identified prefetch address). This parameter can be used to increase the timeliness of prefetches by starting to prefetch far enough ahead of the demand stream to cover the memory system latency. [Richard Cooper <richard.cooper@arm.com>: - Added detail to commit comment and `distance` Param documentation. - Changed `distance` Param from `Param.Int` to `Param.Unsigned`. ] Change-Id: I6c4e744079b53a7b804d8eab93b0f07b566f0c08 Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com> Signed-off-by: Richard Cooper <richard.cooper@arm.com>	2023-11-16 09:48:09 +00:00
Yu-Cheng Chang	ceabe86b31	arch-riscv: Add overrides to RISC-V Interrupts class (#568 )	2023-11-15 18:36:15 -08:00
Matt Sinclair	c3326c78e6	mem-ruby, gpu-compute: fix SQC/TCP requests to same line Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug where they do not behave correctly when multiple requests to the same cache line overlap one another. The intended behavior is that if the first request that arrives at the Ruby code for the SQC/TCP misses, it should send a request to the GPU TCC (L2$). If any requests to the same cache line occur while this first request is pending, they should wait locally at the L1 in the MSHRs (TBEs) until the first request has returned. At that point they can be serviced, and assuming the line has not been evicted, they should hit. For example, in the following test (on 1 GPU thread, in 1 WG): load Arr[0] load Arr[1] load Arr[2] The expected behavior (confirmed via profiling on real GPUs) is that we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a program. However, the current support in the VIPER SQC/TCP code does not model this correctly. Instead it lets all 3 concurrent requests go straight through to the TCC instead of stopping the Arr[1] and Arr[2] requests locally while Arr[0] is serviced. This causes all 3 requests to be classified as misses. To resolve this, this patch adds support into the SQC/TCP code to prevent subsequent, concurrent requests to a pending cache line from being sent in parallel with the original one. To do this, we add an additional transient state (IV) to indicate that a load is pending to this cache line. If a subsequent request of any kind to the same cache line occurs while this load is pending, the requests are put on the local wait buffer and woken up when the first request returns to the SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it transitions from IV --> V. As part of this support, additional transitions were also added to account for corner cases such as what happens when the line is evicted by another request that maps to the same set index while the first load is pending (the line is immediately given to the new request, and when the load returns it completes, wakes up any pending requests to the same line, but does not attempt to change the state of the line) and how GPU bypassing loads and stores should interact with the pending requests (they are forced to wait if they reach the L1 after the pending, non-bypassing load; but if they reach the L1 before the non-bypassing load then they make sure not to change the state of the line from IV if they return before the non-bypassing load). As part of this change, we also move the MSHR behavior from internally in the GPUCoalescer for loads to the Ruby code (like all other requests). This is important to get correct hits and misses in stats and other prints, since the GPUCoalescer MSHR behavior assumes all requests serviced out of its MSHR also miss if the original request to that line missed. Although the SQC does not support stores, the TCP does. Thus, we could have applied a similar change to the GPU stores at the TCP. However, since the TCP support assumes write-through caches and does not attempt to allocate space in the TCP, we elected not to add this support since it seems to run contrary to the intended behavior (i.e., the intended behavior seems to be that writes just bypass the TCP and thus should not need to wait for another write to the same cache line to complete). Additionally, making these changes introduced issues with deadlocks at the TCC. Specifically, some Pannotia applications have accesses to the same cache line where some of the accesses are GLC (i.e., they bypass the GPU L1 cache) and others are non-GLC (i.e., they want to be cached in the GPU L1 cache). We have support already per CU in the above code. However, the problem here is that these requests are coming from different CUs and happening concurrently (seemingly because different WGs are at different points in the kernel around the same time). This causes a problem because our support at the TCC for the TBEs overwrites the information about the GPU bypassing bits (SLC, GLC) every time. The problem is when the second (non-GLC) load reaches the TCC, it overwrites the SLC/GLC information for the first (GLC) load. Thus, when the the first load returns from the directory/memory, it no longer has the GLC bit set, which causes an assert failure at the TCP. After talking with other developers, it was decided the best way handle this and attempt to model real hardware more closely was to move the point at which requests are put to sleep on the wakeup buffer from the TCC to the directory. Accordingly, this patch includes support for that -- now when multiple loads (bypassing or non-bypassing) from different CUs reach the directory, all but the first one will be forced to wait there until the first one completes, then will be woken up and performed. This required updating the WTRequestor information at the TCC to pass the information about what CU performed the original request for loads as well (otherwise since the TBE can be updated by multiple pending loads, we can't tell where to send the final result to). Thus, I changed the field to be named CURequestor instead of WTRequestor since it is now used for more than stores. Moreover, I also updated the directory to take this new field and the GLC information from incoming TCC requests and then pass that information back to the TCC on the response -- without doing this, because the TBE can be updated by multiple pending, concurrent requests we cannot determine if this memory request was a bypassing or non-bypassing request. Finally, these changes introduced a lot of additional contention and protocol stalls at the directory, so this patch converted all directory uses of z_stall to instead put requests on the wakeup buffer (and wake them up when the current request completes) instead. Without this, protocol stalls cause many applications to deadlock at the directory. However, this exposed another issue at the TCC: other applications (e.g., HACC) have a mix of atomics and non-atomics to the same cache line in the same kernel. Since the TCC transitions to the A state when an atomic arrives. For example, after the first pending load returns to the TCC from the directory, which causes the TCC state to become V, but when there are still other pending loads at the TCC. This causes invalid transition errors at the TCC when those pending loads return, because the A state thinks they are atomics and decrements the pending atomic count (plus the loads are never sent to the TCP as returning loads). This patch fixes this by changing the TCC TBEs to model the number of pending requests, and not allowing atomics to be issued from the TCC until all prior, pending non-atomic requests have returned. Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563	2023-11-15 19:23:51 -06:00
Matt Sinclair	065ddf759f	mem-ruby, gpu-compute: fix bug with GPU bypassing loads The current GPU TCP (L1D$) Ruby SLICC code had a bug where a GPU load that wants to bypass the L1D$ (e.g., GLC or SLC bit was set) but the line is in Invalid when that request arrives, results in a non-bypassing load being sent to the GPU TCC (L2$) instead of a bypassing load. This issue was not caught by currently nightly or weekly tests, because the tests do not test for correctness in terms of hits and misses in the caches. However, tests for these corner cases expose this issue. To fix, this, this patch removes the check that the entry is valid when deciding what to do with a bypassing GPU load -- since the TCP Ruby code has transitions for bypassing loads in both I and V, we can simply call the LoadBypassEvict event in both cases and the appropriate transition will handle the bypassing load given the cache line's current state in the TCP. Change-Id: Ia224cefdf56b4318b2bcbd0bed995fc8d3b62a14	2023-11-15 19:23:51 -06:00
hungweihsuG	83f1fe3fec	dev: add debug flag in register bank. (#386 ) Print extra logs for the full/partial read/write access to the registers through the register bank. The debug flag is empty by default and would not print anything. Test: run unittest of dev/reg_bank.test.xml to check the behavior would not affect the original functionality. run gem5 with debug flags and use m5term to poke on registers.	2023-11-15 10:04:46 -08:00
wmin0	a8440f367d	arch-riscv: Move fault handler addr logic to ISA (#554 ) mtvec.mode is extended in the new riscv proposal, like fast interrupt. This change moves that part from Fault class to ISA class for extendable. Ref: https://github.com/riscv/riscv-fast-interrupt	2023-11-15 10:04:01 -08:00
BujSet	4a5ec70e08	gpu-compute: Minor edits for atomic no returns and stores (#565 ) Since returned data is not needed for AtomicNoReturn and Store memory requests, the coalescer need not spend time writing in dummy data for packets of these types. Change-Id: Ie669e8c2a3bf44b5b0c290f62c49c5d4876a9a6a	2023-11-15 07:20:07 -08:00
Derek Christ	e95cab429f	configs,ext,stdlib: Update DRAMSys integration (#525 ) Recent breaking changes in the DRAMSys API require user code to be updated. These updates have been applied to the gem5 integration. Furthermore, as DRAMSys started to use CMake dependency management, it is no longer sensible to maintain two separate build systems for DRAMSys. The use of the DRAMSys integration in gem5 will therefore from now on require that CMake is installed on the target machine. Additionally, support for snapshots have been implemented into DRAMSys and coupled with gem5's checkpointing API.	2023-11-14 08:05:11 -08:00
Derek Christ	99553fdbee	systemc: Fix two bugs in gem5-to-tlm bridge (#542 ) This commit fixes a violation of the TLM2.0 protocol as well as a bug regarding back-pressure: - In the BEGIN_REQ phase, transaction objects are required to set their response status to TLM_INCOMPLETE_RESPONSE. This was not the case in the packet2payload function that converts gem5 packets to TLM2.0 generic payloads. - When the target applies back-pressure to the initiator, an assert condition was triggered as soon as the response is retried. The cause of this was an unintentional nullptr-access into a map.	2023-11-14 08:02:58 -08:00
BujSet	65b44e6516	mem-ruby: Fix for not creating log entries on atomic no return requests (#546 ) Augmenting Datablock and WriteMask to support optional arg to distinguish between return and no return. In the case of atomic no return requests, log entries should not be created when performing the atomic. Change-Id: Ic3112834742f4058a7aa155d25ccc4c014b60199a	2023-11-14 07:54:42 -08:00
Daniel Kouchekinia	be5c03ea9f	mem-ruby,configs: Add GPU GLC Atomic Resource Constraints (#120 ) Added a resource constraint, AtomicALUOperation, to GLC atomics performed in the TCC. The resource constraint uses a new class, ALUFreeList array. The class assumes the following: - There are a fixed number of atomic ALU pipelines - While a new cache line can be processed in each pipeline each cycle, if a cache line is currently going through a pipeline, it can't be processed again until it's finished Two configuration parameters have been used to tune this behavior: - tcc-num-atomic-alus corresponds to the number of atomic ALU pipelines - atomic-alu-latency corresponds to the latency of atomic ALU pipelines Change-Id: I25bdde7dafc3877590bb6536efdf57b8c540a939	2023-11-14 07:48:48 -08:00
Nikolaos Kyparissas	38045d7a25	mem-cache: Added clean eviction check for prefetchers. pkt->req->isCacheMaintenance() would not include a check for clean eviction before notifying the prefetcher, causing gem5 to crash. Change-Id: I1d082a87a3908b1ed46c5d632d45d8b09950b382 Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>	2023-11-14 15:20:52 +00:00
Richard Cooper	6416304e07	mem-cache: Update default prefetch options. Update the default prefetch options to achieve out-of-the box prefetcher performance closer to that which a typical user would expect. Configurations that set these parameters explicitly will be unaffected. The new defaults were identified as part of work on gem5 prefetchers undertaken by Nikolaos Kyparissas while on internship at Arm. Change-Id: Id63868c7c8f00ee15a0b09a6550780a45ae67e55 Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>	2023-11-14 15:20:52 +00:00
Richard Cooper	8598764a03	mem-cache: Squash prefetch queue entries by block address. Prefetch queue entries were being squashed by comparing the address of each queued prefetch against the block address of the demand access. Only prefetches that happen to fall on a cache-line block boundary would be squashed. This patch converts the prefetch addresses to block addresses before comparison. Change-Id: I55ecb4919e94ad314b91c7795bba257c550b1528 Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com>	2023-11-14 15:20:52 +00:00
Yu-Cheng Chang	f11227b4a0	systemc: Fix gcc13 systemC compilation error (#520 ) issue: https://github.com/gem5/gem5/issues/472	2023-11-14 03:54:35 -08:00
Daniel Kouchekinia	dde3d10aea	cpu: Remove SLC bit restraint for GPU tester (#552 ) This reverts gem5#133, the temporary work-around for gem5#131, allowing both SLC and GLC atomic requests to be made in the GPU tester. The underlying issues behind gem5#131 have been resolved by gem5#367 and gem5#397.	2023-11-14 03:47:34 -08:00
Matt Sinclair	48fde5a9c6	mem-ruby, gpu-compute: fix formatting of TCC (#536 ) mem-ruby, gpu-compute: fix formatting of TCC Fix several not properly indented prints and extraneous extra lines in the SLICC code for the GPU TCC (L2 cache).	2023-11-13 15:01:30 -08:00
Matt Sinclair	7d0a1fb284	mem-ruby, gpu-compute: fix typo in GPU coalescer deadlock print (#535 ) mem-ruby, gpu-compute: fix typo in GPU coalescer deadlock print The GPU Coalescer's deadlock print did not previously print a newline at the end of each deadlock, which caused confusion when there were multiple deadlocks as each deadlock print would appear to go with the address after it. This patch fixes this issue.	2023-11-13 15:01:01 -08:00
Matt Sinclair	75ca2c4282	gpu-compute: Fix typo with GPUTLB print (#529 ) gpu-compute: Fix typo with GPUTLB print Print was not properly ending in a newline, which caused confusion when looking a trace with GPUTLB enabled. This fixes that.	2023-11-13 14:40:27 -08:00
Matt Sinclair	f312804364	mem-ruby: fix hex print in CacheMemory (#561 ) Update print in CacheMemory about clearing the lock to properly print in hex.	2023-11-13 14:34:33 -08:00
Matt Sinclair	3642bc4892	mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting (#538 ) mem-ruby, gpu-compute: fix GPU SQC/TCP Ruby formatting Fix several not properly indented prints and extraneous extra lines in the SLICC code for the GPU SQC (L1I$) and TCP (L1D$).	2023-11-13 14:20:54 -08:00
Harshil Patel	50c9cbf613	stdlib, resources: Fixed deprecation warning Change-Id: I61865d9a2c08e344824a735ee5e85fb54cd489da	2023-11-13 14:09:13 -08:00
Bobby R. Bruce	b62308dfa3	base,sim: Add the SymbolType field to the Symbol object (#512 ) Symbol type is part of the info provided by an ELF object's symtab. It indicates whether a symbol is a file symbol, or a function symbol, etc. This chain of commits introduces a way to only load function symbols to the gem5's symbol table. The RISC-V BootloaderKernelWorkload now loads only function symbols from the bootloader and the kernel binaries by default.	2023-11-13 08:14:05 -08:00
Bobby R. Bruce	52354662aa	arch-riscv: Fixing CMO instructions and allowing using CMO instructions in FS mode (#517 ) arch-riscv: Fix implementation of CMO extension instructions This change introduces a template for store instruction's mem access. The new template is called CacheBlockBasedStore. The reasons for not reusing the current Store's mem access template are as follows, - The CMO extension instructions operate on cache block size granularity, while regular load/store instructions operate on data of size 64 bits or fewer. - The writeMemAtomicLE/writeMemTimingLE interfaces do not allow passing nullptr as data. However, CPUs in gem5 rely on (data == NULL) to detect CACHE_BLOCK_ZERO instructions. Setting `Mem = 0;` to `uint64_t Mem;` does not solve the problem as the reference is allocated and thus, it's always true that `&Mem != NULL`. This change uses the writeMemAtomic/writeMemTiming interfaces instead. - Per CMO v1.0.1, the instructions in the spec do not generate address misaligned faults. - The CMO extension instructions do not use IMM. --- arch-riscv: Fix generateDisassembly for Store with 1 source reg Currently, store instructions are assumed to have two source registers. However, since we are supporting the RISC-V CMO instructions, which are Store instructions in gem5 but they only have one source register. This change allows printing disassembly of Store instructions with one source register. --- arch-riscv: Make Zicbom/Zicboz extensions optional in FS mode Currently, we're enable Zicbom/Zicboz by default. Since those extensions might be buggy as they are not well-tested, making those entensions optional allows running simulation where the performance implication of the instructions do not matter. Effectively, by turning off the extensions, we simply remove those extensions from the device tree, so the OS would not use them. It doesn't prohibit the userspace application to use those instructions, however. --- arch-riscv: Add all supporting Z extensions to RISC-V isa string	2023-11-13 03:38:49 -08:00
Matt Sinclair	f61d709321	mem-ruby: update RubyRequest print to include GPU fields (#537 ) mem-ruby: update RubyRequest print to include GPU fields The print function used for RubyRequests did not include the GPU specific fields (for the GLC and SLC bits, which are cache modifiers that specify what level of the memory hierarchy a request should be performed at). This causes confusion when the GPU Ruby SLICC code prints out RubyRequest messages, since important fields are missing. Thus this commit adds that support. Since these fields are already part of the RubyRequest class, and are always 0 for non-GPU requests, it should not affect other components beyond slightly longer prints. Change-Id: I31c9122b82dfa2c6415ce25d225ea82cb35c7333	2023-11-13 01:12:25 -06:00
Daniel Kouchekinia	1204267fd8	mem-ruby: SLICC Fixes to GLC Atomics in WB L2 (#397 ) Made the following changes to fix the behavior of GLC atomics in a WB L2: - Stored atomic write mask in TBE For GLC atomics on an invalid line that bypass to the directory, but have their atomics performed on the return path. - Replaced !presentOrAvail() check for bypassing atomics to directory (which will then be performed on return path), with check for invalid line state. - Replaced wdb_writeDirtyBytes action used when performing atomics with owm_orWriteMask action that doesn't write from invalid atomic request data block - Fixed atomic return path actions Change-Id: I6a406c313d2f9c88cd75bfe39187ef94ce84098f	2023-11-09 13:15:10 -08:00
Matt Sinclair	86131d4323	mem-ruby, gpu-compute: update GPU L1I$ MRU info (#530 ) Previously the GPU L1 I$ (SQC) was not updating the MRU information on hits in the SQC. This commit resolves that by adding support to the appropriate Ruby transition.	2023-11-08 10:13:15 -08:00
Giacomo Travaglini	1f1e15e48f	arch-arm,kvm: Fix copy-paste error (#541 ) This was probably a copy paste error introduced by [1]. Luckily armv7 KVM mode has been superseeded by the armv8 one. [1]: https://gem5-review.googlesource.com/c/public/gem5/+/52059 Change-Id: I260229c94077d856510976bda58383f0564fc15b Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com>	2023-11-08 08:35:02 +00:00
Zixian Cai	f97adbaac7	python: Handle unicode characters in config files (#521 ) Previously, opening a config file (such as `configs/example/hmc_hello.py`) containing non-ASCII characters causes UnicodeDecodeError. Also switch to use more an more idiomatic context manager for handling files. Change-Id: Ia39cbe2c420e9c94f3a84af459b7e5f4d9718d14	2023-11-07 08:59:42 -08:00
Daniel Carvalho	10374f2f05	Fix calculation of compressed size in bytes (#534 ) An integer division in the compression:Base:getSize() was being done, which led to rounding down instead of up. Signed-off-by: Daniel R. Carvalho <odanrc@yahoo.com.br>	2023-11-07 08:58:32 -08:00
Matt Sinclair	76279fef59	mem-ruby: update RubyRequest print to include GPU fields The print function used for RubyRequests did not include the GPU specific fields (for the GLC and SLC bits, which are cache modifiers that specify what level of the memory hierarchy a request should be performed at). This causes confusion when the GPU Ruby SLICC code prints out RubyRequest messages, since important fields are missing. Thus this commit adds that support. Since these fields are already part of the RubyRequest class, and are always 0 for non-GPU requests, it should not affect other components beyond slightly longer prints. Change-Id: I31c9122b82dfa2c6415ce25d225ea82cb35c7333	2023-11-07 00:52:37 -06:00
Jason Lowe-Power	71973b386e	gpu-compute,dev-hsa: ROCm 5.5+ support (#498 ) ROCm 5.5 support including: - Vendor packet completion signals - Queue remapping race condition fix - Backwards compatible GPR allocation - Fix transient readBlob fatal reading kernel descriptor	2023-11-06 10:51:37 -08:00
Yu-Cheng Chang	e4cdd73a59	arch-riscv: Fix line length of CSRData declaration (#519 ) The length of CSRData declaration must less than 79 characters Change-Id: I3767b069664690d7b4498a73536880cfa491c6e5	2023-11-06 10:26:08 -08:00
Harshil Patel	42fd7ff894	stdlib, resources: Update JSON data in workload - resources field in workload now supports a dict with resources id and version. - Older workload JSON are still supported but added a deprecation waring Change-Id: I137dbb99799a5294e84ce7d5d914f05e4cfe9e00	2023-11-03 13:54:30 -07:00
Matthew Poremba	e362310f3d	gpu-compute: Update GPR allocation counts GPR allocation is using fields in the AMD kernel code structure which are not backwards compatible and are not populated in more recent compiler versions. Use the granulated fields instead which is enfored to be backwards compatible. Change-Id: I718716226f5dbeb08369d5365d5e85b029027932	2023-11-01 14:52:39 -05:00
Matthew Poremba	f07e0e7f5d	gpu-compute: Read dispatch packet with timing DMA This fixes occasional readBlob fatals caused by the functional read of system memory, seen often with the KVM CPU. Change-Id: Ifccee666f62faa5b2fcf0a64a9d77c8cf95b3add	2023-11-01 14:52:39 -05:00
Matthew Poremba	37da1c45f3	dev-amdgpu: Better handling for queue remapping The amdgpu driver can, at any time, tell the device to unmap a queue to force the queue descriptor to be written back to main memory in the form of a memory queue descriptor (MQD). It will then immediately remap the queue and continue writing the doorbell to the queue. It is possible that the doorbell write occurs after the queue is unmapped but before it is remapped. In this situation, we need to check the updated value of the doorbell for the queue and write that to the queue after it is mapped. To handle this, a pending doorbell packet map is created to hold a packet to replay when the queue is mapped. Because PCI in gem5 implements only the atomic protocol port, we cannot use the original packet as it must respond in the same Tick. This patch fixes issues with the doorbell maps not being cleared on unmapping to ensure the doorbell is not found in writeDoorbell and places in the pending doorbell map. This includes fixing the doorbell offset value in the doorbell to VMID map which was is now multiplied by four as it is a dword address. This was tested using tensorflow 2.0's MNIST example which was seeing this issue consistently. With this patch it now makes progress and does issue pending doorbell writes. Change-Id: Ic6b401d3fe7fc46b7bcbf19a769cdea6814e7d1e	2023-11-01 14:52:39 -05:00
Matthew Poremba	d05433b3f6	gpu-compute,dev-hsa: Send vendor packet completion signal gem5 does not currently implement any vendor-specific HSA packets. Starting in ROCm 5.5, vendor packets appear to end with a completion signal. Not sending this completion causes gem5 to hang. Since these packets are not documented anywhere and need to be reverse engineered we send the completion signal, if non-zero, and finish the packet as is the current behavior. Testing: HIP examples working on most recent ROCm release (5.7.1). Change-Id: Id0841407bec564c84f590c943f0609b17e01e14c	2023-11-01 14:52:39 -05:00
Hoa Nguyen	68287604ee	arch-riscv: Make Zicbom/Zicboz extensions optional in FS mode Currently, we're enable Zicbom/Zicboz by default. Since those extensions might be buggy as they are not well-tested, making those entensions optional allows running simulation where the performance implication of the instructions do not matter. Effectively, by turning off the extensions, we simply remove those extensions from the device tree, so the OS would not use them. It doesn't prohibit the userspace application to use those instructions, however. Change-Id: Ib30e98c4c39f741dec5f7d31bd7b832391686840 Signed-off-by: Hoa Nguyen <hn@hnpl.org>	2023-10-30 21:45:13 +00:00
Hoa Nguyen	7c6fcb3838	arch-riscv: Add all supporting Z extensions to RISC-V isa string Change-Id: I809744fc546bc5c0e27380f9b75bdf99f8520583 Signed-off-by: Hoa Nguyen <hn@hnpl.org>	2023-10-30 21:45:10 +00:00
Hoa Nguyen	f615ee4cd4	arch-riscv: Fix generateDisassembly for Store with 1 source reg Currently, store instructions are assumed to have two source registers. However, since we are supporting the RISC-V CMO instructions, which are Store instructions in gem5 but they only have one source register. This change allows printing disassembly of Store instructions with one source register. Change-Id: I4dd7818c9ac8a89d5e10e77db72248942a25e938 Signed-off-by: Hoa Nguyen <hn@hnpl.org>	2023-10-30 21:44:18 +00:00
Hoa Nguyen	2521ba0664	arch-riscv: Fix implementation of CMO extension instructions This change introduces a template for store instruction's mem access. The new template is called CacheBlockBasedStore. The reasons for not reusing the current Store's mem access template are as follows, - The CMO extension instructions operate on cache block size granularity, while regular load/store instructions operate on data of size 64 bits or fewer. - The writeMemAtomicLE/writeMemTimingLE interfaces do not allow passing nullptr as data. However, CPUs in gem5 rely on (data == NULL) to detect CACHE_BLOCK_ZERO instructions. Setting `Mem = 0;` to `uint64_t Mem;` does not solve the problem as the reference is allocated and thus, it's always true that `&Mem != NULL`. This change uses the writeMemAtomic/writeMemTiming interfaces instead. - Per CMO v1.0.1, the instructions in the spec do not generate address misaligned faults. - The CMO extension instructions do not use IMM. Change-Id: I323615639a4ba882fe40a55ed32c7632e0251421 Signed-off-by: Hoa Nguyen <hn@hnpl.org>	2023-10-30 21:44:18 +00:00

1 2 3 4 5 ...

14646 Commits