derek/gem5 - gem5 - Gitea: Git with a cup of tea

derek/gem5

Author	SHA1	Message	Date
Matthew Poremba	da11427ba6	gpu-compute: Update tokens for flat global/scratch (#408 ) Memory instructions acquire coalescer tokens in the schedule stage. Currently this is only done for buffer and flat instructions, but not flat global or flat scratch. This change now acquires tokens for flat global and flat scratch instructions. This provides back-pressure to the CUs and helps to avoid deadlocks in Ruby. The change also handles returning tokens for buffer, flat global, and flat scratch instructions. This was previously only being done for normal flat instructions leading to deadlocks in some applications when the tokens were exhausted. To simplify the logic, added a needsToken() method to GPUDynInst which return if the instruction is buffer or any flat segment. The waitcnts were also incorrect for flat global and flat scratch. We should always decrement vmem and exp count for stores and only normal flat instructions should decrement lgkm. Currently vmem/exp are not decremented for flat global and flat scratch which can lead to deadlock. This change set fixes this by always decrementing vmem/exp and lgkm only for normal flat instructions. Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5	2023-10-11 09:00:10 -07:00
Matthew Poremba	9f4d334644	gpu-compute: Update tokens for flat global/scratch Memory instructions acquire coalescer tokens in the schedule stage. Currently this is only done for buffer and flat instructions, but not flat global or flat scratch. This change now acquires tokens for flat global and flat scratch instructions. This provides back-pressure to the CUs and helps to avoid deadlocks in Ruby. The change also handles returning tokens for buffer, flat global, and flat scratch instructions. This was previously only being done for normal flat instructions leading to deadlocks in some applications when the tokens were exhausted. To simplify the logic, added a needsToken() method to GPUDynInst which return if the instruction is buffer or any flat segment. The waitcnts were also incorrect for flat global and flat scratch. We should always decrement vmem and exp count for stores and only normal flat instructions should decrement lgkm. Currently vmem/exp are not decremented for flat global and flat scratch which can lead to deadlock. This change set fixes this by always decrementing vmem/exp and lgkm only for normal flat instructions. Change-Id: I673f4ac6121e4b5a5e8491bc9130c6d825d95fc5	2023-10-10 09:48:16 -05:00
Matthew Poremba	6a4b2bb096	dev-hsa,gpu-compute: Add timestamps to AMD HSA signals The AMD specific HSA signal contains start/end timestamps for dispatch packet completion signals. These are current always zero. These timestamp values are used for profiling in the ROCr runtime. Unfortunately, the GpuAgent::TranslateTime method in ROCr does not check for zero values before dividing, causing applications that use profiling to crash with SIGFPE. Profiling is used via hipEvents in the HACC application, so these should be supported in gem5. In order to handle writing the timestamp values, we need to DMA the values to memory before writing the completion signal. This changes the flow of the async completion signal write to be (1) read mailbox pointer (2) if valid, write the mailbox data, other skip to 4 (3) write mailbox data if pointer is valid (4) write timestamp values (5) write completion signal. The application will process the timestamp data as soon as the completion signal is received, so we need to ordering to ensure the DMA for timestamps was completed. HACC now runs to completion on GPUFS and has the same output was hardware. Change-Id: I09877cdff901d1402140f2c3bafea7605fa6554e	2023-10-06 13:21:40 -05:00
Matthew Poremba	2b97f17fe1	gpu-compute: Fix dynamic scratch size test ROCm supports dynamically allocating scratch space, which resides in framebuffer memory, to reduce the amount of memory allocated for kernels that have not yet launched. The size of the scratch space allocated is located in task->amdQueue.compute_tmpring_size_wavesize. This size is in kilobytes. The AQL task contains the number of bytes requested per work item, however we currently check if there is enough tmpring space by comparing a single work item. This should instead check the size per wavefront. This causes problems in applications where multiple kernels use dynamic scratch allocation and a later kernel requires more space than the earlier kernel. The only application being tested that does this is LULESH. This was resulting in the scratch space being too small, resulting in workgroups clobbering each other's private memory leading to some nasty bugs. It is fixed by this patch as task->amdQueue will be re-read from the host and will contain the updated tmpring size. After this there is enough scratch space and LULESH makes forward progress. Change-Id: Ie9e0f92bb98fd3c3d6c2da3db9ee65352f9ae070	2023-10-04 09:38:31 -05:00
Matthew Poremba	cfa833a97d	gpu-compute: Set LDS/scratch aperture base register Starting with gfx900 (Vega) the LDS and scratch apertures can be queried using a new s_getreg_b32 instruction. If the instruction is called with the SH_MEM_BASES argument it returns the upper 16 bits of a 64 bit address for the LDS and scratch apertures. The current addresses cannot be encoded in this register, so that addresses are changed to have the lower 48 bits be all zeros in addition to writing the bases register. Change-Id: If20f262b2685d248afe31aa3ebb274e4f0fc0772	2023-08-31 11:01:32 -05:00
Matthew Poremba	60f071d09a	gpu-compute,arch-vega: Implement flat scratch insts Flat scratch instructions (aka private) are the 3rd and final segment of flat instructions in gfx9 (Vega) and beyond. These are used for things like spills/fills and thread local storage. This commit enables two forms of flat scratch instructions: (1) flat_load/flat_store instructions where the memory address resolves to private memory and (2) the new scratch_load/scratch_store instructions in Vega. The first are similar to older generation ISAs where the aperture is unknown until address translation. The second are instructions guaranteed to go to private memory. Since these are very similar to flat global instructions there are minimal changes needed: - Ensure a flat instruction is either regular flat, global, XOR scratch - Rename the global op_encoding methods to GlobalScratch to indicate they are for both and are intentionally used. - Flat instructions in segment 1 output scratch_ in the disassembly - Flat instruction executed as private use similar mem helpers as global - Flat scratch cannot be an atomic This was tested using a modified version of the 'square' application: template <typename T> __global__ void scratch_square(T C_d, T A_d, size_t N) { size_t offset = (blockIdx.x * blockDim.x + threadIdx.x); size_t stride = blockDim.x * gridDim.x ; volatile int foo; // Volatile ensures scratch / unoptimized code for (size_t i=offset; i<N; i+=stride) { foo = A_d[i]; C_d[i] = foo * foo; } } Change-Id: Icc91a7f67836fa3e759fefe7c1c3f6851528ae7d	2023-08-26 13:40:12 -05:00
Matthew Poremba	4506188e00	gpu-compute: Fix private offset/size register indexes According to the ABI documentation from LLVM, the low register of flat scratch (maxSGPR - 4) is the offset and the high register (maxSGPR - 3) is size. These are currently backwards, resulting in some gnarly addresses being generated leading to page fault and/or incorrect data. This commit fixes this by setting the order correctly. Change-Id: I0b1d077c49c0ee2a4e59b0f6d85cdb8f17f9be61	2023-08-26 13:40:12 -05:00
Matthew Poremba	e0379f4526	gpu-compute: Fix flat scratch resource counters Flat instructions may access memory locations in LDS (scratchpad) and global (VRAM/framebuffer) and therefore increment both counters when dispatched. Once the aperture is known, we decrement the counters of the aperture that was not used. This is done incorrectly for scratch / private flat instruction. Private memory is global and therefore local memory counters should be decremented. This commit fixes the counters by changing the global decrements to local decrements. Change-Id: I25890446908df72e5469e9dbaba6c984955196cf	2023-08-26 13:40:12 -05:00
Matthew Poremba	57b3d2897c	gpu-compute: Use timing DMAs for GPUFS HSA signals The functional HSA signal read was a hack left in the gpu-compute code. In full system, this functional read is causing problems occasionally with the translation not yet being in the page table. The error message output by gem5 was a fatal message on the readBlob method in port proxy. Changing this to a timing DMA fixes this problem. This commit adds the various timing DMA functions to send and receive response and clean up. A helper method "sendCompletionSignal" is added to the GPUCommandProcessor because the indentation level was getting too deep. This change applies only to FS mode. Code for SE mode is equivalent to what it was before this commit. Change-Id: I1bfcaa0a52731cdf9532a7fd0eb06ab2f0e09d48	2023-08-25 13:10:51 -05:00
Matthew Poremba	90a518e885	gpu-compute,arch-vega: Fix ALU-only LDS counters There are a few LDS instructions that perform local ALU operations and writeback which are marked as loads. These are marked as loads because they fit in the pipeline logic better, according to a several year old comment. In the VEGA ISA these instructions (swizzle, permute, bpermute) are not decrementing the LDS load counter. As a result, the counter will gradually increase over time. Since wavefront slots are persistent, this can cause applications with a few thousand kernels to eventually hang thinking there are not enough resources. This changeset fixes this by decrementing the LDS load counter for these instructions. This fix was already integrated in the GCN3 ISA in the exact same way. This changeset moves it near a similar comment about scheduling register file writes. Change-Id: Ife5237a2cae7213948c32ef266f4f8f22917351c	2023-08-23 19:30:24 -05:00
Matthew Poremba	df4739929d	gpu-compute: Change kernel-based exit location The previous exit event occurs when the dispatcher sends a completion signal for a kernel, but gem5 does some kernel-based stats updates after the signal is sent. Therefore, if these exit events are used as a way to dump per-kernel stats, some of the stats for the kernel that just ended will be in the next kernel's stat dump which is misleading. This patch moves the exit event to where the stats are updated and only exits if the dispatcher has requested a stat dump to prevent situations where stats are updated mid-kernel. Change-Id: I74dc1cad5fc90382a2a80564764b3e7c9fb65521	2023-08-15 11:06:26 -05:00
KaiBatley	efa1d87add	configs: fix GPU's default number of HW barrier/CU (#92 ) AMD GCN3 and Vega GPUs assume a max of 16 WG/CU. Any GPU WG with more than 1 WF requires a hardware barrier to allow WFs in the WG to synchronize locally. However, currently the default gem5 GPU configuration assumes only 4 barriers per CU, which artificially prevents applications with > 4 WG/CU that could run simultaneously from running simultaneously. This fix resolves this by updating the default number of hardware barriers per CU to 16, which mimics the support described in slide 39 here: https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf Change-Id: Ib7636a13359d998e676c1790f436a83ce88cbfc0	2023-07-17 10:42:40 -07:00
Jason Lowe-Power	442923c414	Add feature to output citations automatically based on configuration (#90 ) This change adds a new file to m5out which is citations.bib. This file will contain the citations to the papers which describe the aspects of the gem5 simulator that the simulation uses. In other words, each simulation configuration could generate a different bib file referencing different works. Each SimObject can now have a set of citations associated with it. After the system is built (in `instantiate`), the citations.bib file is created by parsing all SimObjects that have been instantiated and taking the union of their associated citations. This commit is not meant to add all citations, but to act as an example for others to add more citations to gem5. Change-Id: Icd5c46fd9ee44adbeec1fea162657f5716f7e5ef Signed-off-by: Jason Lowe-Power <jason@lowepower.com>	2023-07-17 10:41:51 -07:00
Matthew Poremba	3756af8ed9	gpu-compute,configs: Make sim exits conditional The unconditional exit event when a kernel completes that was added in `c644eae2dd` is causing scripts that do not ignore unknown exit events to end simulation prematurely. One such script is the apu_se.py script used in SE mode GPU simulation. Make this exit conditional to the parameter being set to a valid value to avoid this problem. Change-Id: I1d2c082291fdbcf27390913ffdffb963ec8080dd Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/72098 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2023-07-07 14:12:54 +00:00
Matthew Poremba	c644eae2dd	configs,gpu-compute: Kernel dispatch-based exit events Add two kernel dispatch-based exit events that are useful for limiting the simulation and enabling debug flags at specific GPU kernels. Since the KVM CPU typically used with GPUFS is not deterministic, this help with enabling debug flags when the Tick number may vary. The exit at GPU kernel option can also limit simulation by only simulating a few hundred kernels, for example, and exit at a determined point. Change-Id: I81bae92a80c25fc38c41e999aa662e1417b7a20d Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71418 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2023-06-08 22:03:47 +00:00
Matthew Poremba	ebd5b3e4ae	gpu-compute: Gfx version check for FS and SE mode There is no GPU device in SE mode to get version from and no GPU driver in FS mode to get version from, so a conditional needs to be added depending on the mode to get the gfx version. Change-Id: I33fdafb60d351ebc5148e2248244537fb5bebd31 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/71078 Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2023-06-01 00:15:02 +00:00
Matthew Poremba	6b4a1020be	configs,dev-amdgpu: GPUFS MI200/gfx90a support Add support for MI200-like device. This includes adding PCI IDs and new MMIOs for the device, a different MAP_PROCESS packet, and a different calculation for the number of VGPRs. Change-Id: I0fb7b3ad928826beaa5386d52a94ba504369cb0d Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/70317 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2023-05-25 19:14:32 +00:00
Giacomo Travaglini	7b39a7f14e	misc: Rename DEBUG macro into GEM5_DEBUG The DEBUG macro is not part of any compiler standards (differently from NDEBUG, which elides assertions). It is only meant to differentiate gem5.debug from .fast and .opt builds. gem5 developers have used it to insert helper code that is supposed to aid the debugging process in case anything goes wrong. This generic name is likely to clash with other libraries linked with gem5. This is the case of DRAMSim as an example. Rather than using undef tricks, we just inject a GEM5_DEBUG macro for gem5.debug builds. Change-Id: Ie913ca30da615bd0075277a260bbdbc397b7ec87 Signed-off-by: Giacomo Travaglini <giacomo.travaglini@arm.com> Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/69079 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Daniel Carvalho <odanrc@yahoo.com.br> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com>	2023-03-21 06:53:55 +00:00
Gabriel Busnot	7f4c92c910	mem,arch-arm,mem-ruby,cpu: Remove use of deprecated base port owner Change-Id: I29214278c3dd4829c89a6f7c93214b8123912e74 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67452 Reviewed-by: Bobby Bruce <bbruce@ucdavis.edu> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Daniel Carvalho <odanrc@yahoo.com.br> Maintainer: Bobby Bruce <bbruce@ucdavis.edu> Reviewed-by: Daniel Carvalho <odanrc@yahoo.com.br>	2023-02-03 06:11:45 +00:00
Vishnu Ramadas	d6bbccb60a	gpu-compute : Fix incorrect TLB stats when FunctionalTLB is used When FunctionalTLB is used in SE mode, the stats tlbLatency and tlbCycles report negative values. This patch fixes it by disabling the updates that result in negative values when FunctionalTLB is set to true Change-Id: I6962785fc1730b166b6d5b879e9c7618a8d6d4b3 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/67202 Reviewed-by: Matt Sinclair <mattdsinclair.wisc@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matthew Poremba <matthew.poremba@amd.com> Reviewed-by: Matthew Poremba <matthew.poremba@amd.com> Tested-by: kokoro <noreply+kokoro@google.com>	2023-01-10 02:27:29 +00:00
Matthew Poremba	af2cecf59e	gpu-compute: Fix ABI init for DispatchId DispatchId should allocate two SGPRs instead of one. Allocating one was causing all subsequent SGPR index values to be off by one, leading to bad addresses for things like flat scratch and private segment. This field is not used very often so it was not impacting most applications. Change-Id: I17744e2d099fbc0447f400211ba7f8a42675ea06 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66711 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-12-16 18:16:18 +00:00
Hoa Nguyen	eac06ad681	python: Fix multiline quotes in a single line An example case, ```python mem_side_port = RequestPort( "This port sends requests and " "receives responses" ) ``` This is the residue of running the python formatter. This is done by finding all tokens matching the regex `"\s"(?![.;"])` and manually replacing them by empty strings. Change-Id: Icf223bbe889e5fa5749a81ef77aa6e721f38b549 Signed-off-by: Hoa Nguyen <hoanguyen@ucdavis.edu> Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/66111 Reviewed-by: Bobby Bruce <bbruce@ucdavis.edu> Maintainer: Bobby Bruce <bbruce@ucdavis.edu> Tested-by: kokoro <noreply+kokoro@google.com>	2022-11-29 23:44:38 +00:00
vramadas95	dff879cf21	configs, gpu-compute: Add configurable L1 scalar latencies Previously the scalar cache path used the same latency parameter as the vector cache path for memory requests. This commit adds new parameters for the scalar cache path latencies. This commit also modifies the model to use the new latency parameter to set the memory request latency in the scalar cache. The new paramters are '--scalar-mem-req-latency' and '--scalar-mem-resp-latency' and are set to default values of 50 and 0 respectively Change-Id: I7483f780f2fc0cfbc320ed1fd0c2ee3e2dfc7af2 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65511 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-11-12 02:23:02 +00:00
Matthew Poremba	8d63c9fc06	gpu-compute: Add granulated SGPR computation for gfx9 The granulated SGPR size is used when the number of SGPRs is unknown. The computation for this has changed since gfx8 and is commented as a TODO in a comment. This changeset implements the change and also checks for an invalid SGPR count. According to LLVM code this could happen "due to a compiler bug or when using inline asm.": https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/ AMDGPUAsmPrinter.cpp#L723 Change-Id: Ie487a53940b323a0002341075e0f81af4147a7d8 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65252 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-11-08 21:34:11 +00:00
Matthew Poremba	f6dc5c6aa4	gpu-compute: Chunkify AMDKernelCode read from device The AMDKernelCode object can span potentially span two pages. Currently the copy loop from device memory only translates once at the base address. This changeset translates one cache line at a time before copying and has the ancillary benefit for cleaning up this code a bit. Change-Id: I602bc12d8f8c5d3a3e57ab3f42f7dd3df58dc144 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/65251 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com>	2022-11-08 21:34:11 +00:00
Matthew Poremba	ec696c00b2	gpu-compute: Add missing initial reg state in WF There are two initial scalar register fields that are not initialized in the wavefront when a task is dispatch. This changeset adds the missing DispatchId and PrivateSegSize fields. These fields are typically used when an application is compiled with debug support and are typically not used in the applications in gem5's test suite. Change-Id: I5b5fa75e4badfd9ba7588e4cd485ebf75fd5d627 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/64191 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-10-06 23:14:40 +00:00
Matthew Poremba	9f5c0f2822	gpu-compute: dprint instruction requesting translation When debugging strange addresses, it is extremely useful to know what instruction calculated that address. This make it much easier to follow assembly code backwards to find the source of an incorrect address. This change adds a DPRINTF for GPUTLB that by default prints the disassembly when a virtual address translation is sent to the TLB. Change-Id: I5066c064a48c5c48696863eeccd8d011245ef7b2 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/63176 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-09-09 04:13:49 +00:00
Gabe Black	f4209bbdee	misc: Remove lingering uses of TheISA::. Change-Id: Ie55e0d79867fbc8f75a993fb456a58c84de5def4 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/62196 Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Giacomo Travaglini <giacomo.travaglini@arm.com>	2022-08-20 07:30:16 +00:00
Alexandru Dutu	c6b38909e1	gpu-compute: Adding support for LDS atomics This changeset is adding support for LDS atomics and implementing DS_OR_B32 instruction. Change-Id: I84c5cf6ce0e9494726dc7299f360551cd2a485f5 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61791 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-08-19 16:44:31 +00:00
Bobby R. Bruce	787204c92d	python: Apply Black formatter to Python files The command executed was `black src configs tests util`. Change-Id: I8dfaa6ab04658fea37618127d6ac19270028d771 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/47024 Maintainer: Bobby Bruce <bbruce@ucdavis.edu> Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Giacomo Travaglini <giacomo.travaglini@arm.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-08-03 09:10:41 +00:00
Matthew Poremba	68115460d8	gpu-compute: Set LDS and Scratch apertures in FS The LDS and scratch aperture base and limits are hardcoded to some values that are useful for SE mode. In reality, these are chosen by the driver so we need to honor whatever values the driver passes so that when addresses are calculated they fall into the correct aperture to route flat instructions to those apertures. This overwrites the default hardcoded values for LDS and scratch base and limit using the values providing by the driver in a MAP_PROCESS packet. Change-Id: I0e194a26631f697819d8aaecf1bf346a7b7c7026 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61656 Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com>	2022-07-28 14:10:33 +00:00
Matthew Poremba	f65f5a8981	gpu-compute,arch-vega: Overhaul HWRegs, setreg, getreg These instructions are supposed to be read/writing special shader hardware registers. Currently they are getting/setting to an SGPR. This results in getting incorrect registers at best and clobbering an SGPR being used by an application at worst. Furthermore, some registers need to be set in the shader and the application will never (can never) set them. This patch overhauls the getreg/setreg instructions to use different storage in the shader. The values will be updated either via setreg from an application (e.g., mode register) or set by a PM4 MAP_PROCESS. Change-Id: Ie5e5d552bd04dc47f5b35b5ee40a569ae345abac Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61655 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-07-28 14:10:33 +00:00
Matthew Poremba	ee75e19b8b	gpu-compute: Fix dynamic scratch allocation on GPUFS When GPU needs more scratch it requests from the runtime. In the method to wait for response, a dmaReadVirt is called with the same method as the callback with zero delay. This means that effectively there is an infinite loop in the event queue if the scratch setup is not successful on the first attempt. In the case of GPUFS, it is never successfully instantly so a delay must be added. Without added delay, the host CPU is never scheduled to make progress setting up more scratch space. The value 1e9 is choosen to match the KVM quantum and hopefully give KVM a chance to schedule an event. For reference, the driver timeout is 200ms so this is still fairly aggressive checking of the signal response. This value is also balanced around the GPUCommandProc DPRINTF to prevent the print in this method from overwhelming debug output. Change-Id: I0e0e1d75cd66f7c47815b13a4bfc3c0188e16220 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/61651 Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com>	2022-07-28 14:10:33 +00:00
Gabe Black	397e66d8b6	gpu-compute: Stop passing in a default value to resize(). A default constructed element of the container is the default value to the second resize() parameter. Having that parameter explicitly causes a warning/error in newer versions of gcc, and is unnecessary regardless. Change-Id: I6aee2d23f0f4382b00caf552c8e38940614c5f9a Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/60311 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Maintainer: Gabe Black <gabe.black@gmail.com> Reviewed-by: Yu-hsin Wang <yuhsingw@google.com>	2022-06-03 10:48:31 +00:00
Matthew Poremba	8fe975e57e	gpu-compute: Fatal on dynamic scratch allocation in GPUFS This is known not working in GPUFS. As a result, the simulation will never end. Rather than simulate forever, add a fatal for now to exit simulation until support for this functionality is added. Change-Id: I8e45996a7eb781575e8643baea05daf87bc5f1c3 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/58472 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-08 17:12:32 +00:00
Matthew Poremba	e36a8dbd8a	gpu-compute: Handle GPUFS system store responses Requests in GPUFS which go to system memory will not generate the WriteCompleteResp packets that the VIPER protocol would normally created for device requests which go through the caches. Therefore, we need to callback the GM pipe handleResponse to complete the access and make forward progress. Change-Id: Ic00c430ce420a591fe5743f758b780d93afd2a38 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57989 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-07 20:11:01 +00:00
Matthew Poremba	6feaa88e27	gpu-compute: Command processor read path from device In full system mode, the AMDKernelCode object can reside in either the system memory or in the dGPU device memory. Currently only reading from the host/system memory is supported. This adds the necessary code to read from the dGPU device memory. Change-Id: I887fc706b3f9834db14e40f36fd29dd3d4602925 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57710 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-07 20:11:01 +00:00
Matthew Poremba	fcbc9afcd6	gpu-compute: Don't use emulated driver in full system The emulated driver is currently called in a few locations unconditionally. This changeset adds checks that we are not in full system before calling any emulated driver function. In full system the amdgpu driver running on the disk image handles these functions. Change-Id: Iea3546b574e29c649351c0fce9154530be89e9b1 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57712 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-07 20:11:01 +00:00
Matthew Poremba	f375e79bcf	gpu-compute: Support Scalar and Vector access to system pages The amdgpu driver supports reading and writing scalar and vector memory addresses that reside in system memory. This is commonly used for things like blit kernels that perform host-to-device or device-to-host copies using GPU load/store instructions. This is done by utilizing the system hub device added in a prior changeset. Memory packets translated by the Scalar or VMEM TLBs will have the correspoding system request field set from the PTE in the TLB which can be used in the compute unit to determine if a request is for system memory or not. Another important change is to return global memory tokens for system requests. Since these do not flow through the GPU coalescer where the token is returned, the token can be returned once the request is known to be a system request. Change-Id: I35030e0b3698f10c63a397f96b81267271e3130e Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57711 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-07 20:11:01 +00:00
Matthew Poremba	347364ab0f	gpu-compute: Handle mailbox/wakeup signals for GPUFS The current mailbox/wakeup signal uses the SE mode proxy port to write the event value. This is not available in full system mode so instead we need to issue a DMA write to the address. The value of event_val clears the event. Change-Id: I424469076e87e690ab0bb722bac4c3e7414fb150 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57709 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-04-07 20:11:01 +00:00
Matthew Poremba	91e8bbe299	configs,gpu-compute: Support fetch from system pages The amdgpu driver supports fetching instructions from pages which reside in system memory rather than device memory. This changeset adds support to do this by adding the system hub object added in a prior changeset to the fetch unit and issues requests to the system hub if the system bit in the memory page's PTE is set. Otherwise, the requestor ID is set to be device memory and the request is routed through the Ruby network / GPU caches to fetch the instructions. Change-Id: Ib2fb47c589fdd5e544ab6493d7dbd8f2d9d7b0e8 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57652 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-28 23:24:53 +00:00
Gabe Black	e6c0ba97db	scons: Put all config variables in an env['CONF'] sub-dict. This makes what are configuration and what are internal SCons variables explicit and separate, and makes it unnecessary to call out what variables to export to C++. These variables will also be plumbed into and out of kconfiglib in later changes. Change-Id: Iaf5e098d7404af06285c421dbdf8ef4171b3f001 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/56892 Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com> Maintainer: Gabe Black <gabe.black@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-28 20:31:21 +00:00
Matthew Poremba	51648570ea	gpu-compute: Add methods to read GPU memory requestor ID These methods are called from various places to override the requestor ID of a request in order to determine which Ruby network a request should be routed on. Change-Id: Ic0270ddd7123f0457a13144e69ef9132204d4334 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/57651 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-25 19:51:29 +00:00
Matthew Poremba	581e451723	gpu-compute,dev-hsa: Update CP and HSAPP for full-system Make the necessary changes to connect Vega pagetable walkers for full-system mode. Previously the CP and HSA packet processor could only read AQL packets from system/host memory using proxy port. This allows for AQL to be read from device memory which is used for non-blit kernels. Change-Id: If28eb8be68173da03e15084765e77e92eda178e9 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53077 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-25 19:51:29 +00:00
Matthew Poremba	539a2e2bcd	arch-vega: Add VEGA page tables and TLB Add the page table walker, page table format, TLB, TLB coalescer, and associated support in the AMDGPUDevice. This page table format used the hardware format for dGPU and is very different from APU/GCN3 which use the X86 page table format. In order to support either format for the GPU model, a common TranslationState called GpuTranslation state is created which holds the combined fields of both the APU and Vega translation state. Similarly the TlbEntry is cast at runtime by the corresponding arch files as they are the only files which touch the internals of the TlbEntry. The GPU model only checks if a TlbEntry is non-null and thus does not need to cast to peek inside the data structure. Change-Id: I4484c66239b48df5224d61caa6e968e56eea38a5 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/51848 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-17 00:11:14 +00:00
Gabe Black	06117275fa	scons: Make all sticky variables automatically exported. All sticky vars are exported, but not all exported vars are sticky. The vars which are exported but not sticky are (at least in general) found with Configure() style measurement. Change-Id: Idebf17e44c2eeca745cdfdd9f42eddcfdb0cf9ed Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/56891 Maintainer: Gabe Black <gabe.black@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>	2022-03-15 00:45:30 +00:00
Matthew Poremba	45ad755511	gpu-compute: Fix default MTYPE initialization The default MTYPE initialization in the emulated GPU driver is currently doing a bitwise AND on an input integer param with other integers instead of using a bitmask. Change this to use bitset and test the bit positions corresponding to the values in the MTYPE enum that were previously being used as an operand for bitwise AND. This was causing invalid slicc transitions in some benchmarks for combinations of request type and mtype that are undefined. Change-Id: I93fee0eae1fff7141cd14c239c16d1d69925d08d Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/56367 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-03-13 21:50:21 +00:00
Kyle Roarty	78451f6685	gpu-compute: Fix register checking and allocation in dyn manager This patch updates the canAllocate function to account both for the number of regions of registers that need to be allocated, and for the fact that the registers aren't one continuous chunk. The patch also consolidates the registers as much as possible when a register chunk is freed. This prevents fragmentation from making it impossible to allocate enough registers Change-Id: Ic95cfe614d247add475f7139d3703991042f8149 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/56909 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com> Reviewed-by: Matthew Poremba <matthew.poremba@amd.com>	2022-02-18 18:46:33 +00:00
Kyle Roarty	4d2cbefd1e	gpu-compute: Set scratch_base, lds_base for gfx902 When updating how scratch_base and lds_base were set, gfx902 was left out. This adds in gfx902 to the case statement, allowing the apertures to be set and for simulations using gfx902 to not error out Change-Id: I0e1adbdf63f7c129186fb835e30adac9cd4b72d0 Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/54663 Reviewed-by: Matt Sinclair <mattdsinclair@gmail.com> Maintainer: Matt Sinclair <mattdsinclair@gmail.com> Reviewed-by: Matthew Poremba <matthew.poremba@amd.com> Maintainer: Matthew Poremba <matthew.poremba@amd.com> Tested-by: kokoro <noreply+kokoro@google.com>	2022-02-17 21:20:16 +00:00
Matthew Poremba	9313294efe	misc: Remove AMD license addition Remove the line "For use for simulation and test purposes only" in files were AMD is the only copyright holder listed in the header. This happens to be the case for all files where this line exists, removing it completely from gem5. Change-Id: I623f266b002f564301b28774f49081099cfc60fd Reviewed-on: https://gem5-review.googlesource.com/c/public/gem5/+/53943 Reviewed-by: Jason Lowe-Power <power.jg@gmail.com> Maintainer: Jason Lowe-Power <power.jg@gmail.com> Tested-by: kokoro <noreply+kokoro@google.com>	2021-12-11 04:00:56 +00:00

1 2 3 4 5 ...

264 Commits