mem-ruby, gpu-compute: fix SQC/TCP requests to same line (#540)

Currently, the GPU SQC (L1I$) and TCP (L1D$) have a performance bug where they do not behave correctly when multiple requests to the same cache line overlap one another. The intended behavior is that if the first request that arrives at the Ruby code for the SQC/TCP misses, it should send a request to the GPU TCC (L2$). If any requests to the same cache line occur while this first request is pending, they should wait locally at the L1 in the MSHRs (TBEs) until the first request has returned. At that point they can be serviced, and assuming the line has not been evicted, they should hit. For example, in the following test (on 1 GPU thread, in 1 WG): load Arr[0] load Arr[1] load Arr[2] The expected behavior (confirmed via profiling on real GPUs) is that we should get 1 miss (Arr[0]) and 2 hits (Arr[1], Arr[2]) for such a program. However, the current support in the VIPER SQC/TCP code does not model this correctly. Instead it lets all 3 concurrent requests go straight through to the TCC instead of stopping the Arr[1] and Arr[2] requests locally while Arr[0] is serviced. This causes all 3 requests to be classified as misses. To resolve this, this patch adds support into the SQC/TCP code to prevent subsequent, concurrent requests to a pending cache line from being sent in parallel with the original one. To do this, we add an additional transient state (IV) to indicate that a load is pending to this cache line. If a subsequent request of any kind to the same cache line occurs while this load is pending, the requests are put on the local wait buffer and woken up when the first request returns to the SQC/TCP. Likewise, when the first load is returned to the SQC/TCP, it transitions from IV --> V. As part of this support, additional transitions were also added to account for corner cases such as what happens when the line is evicted by another request that maps to the same set index while the first load is pending (the line is immediately given to the new request, and when the load returns it completes, wakes up any pending requests to the same line, but does not attempt to change the state of the line) and how GPU bypassing loads and stores should interact with the pending requests (they are forced to wait if they reach the L1 after the pending, non-bypassing load; but if they reach the L1 before the non-bypassing load then they make sure not to change the state of the line from IV if they return before the non-bypassing load). As part of this change, we also move the MSHR behavior from internally in the GPUCoalescer for loads to the Ruby code (like all other requests). This is important to get correct hits and misses in stats and other prints, since the GPUCoalescer MSHR behavior assumes all requests serviced out of its MSHR also miss if the original request to that line missed. Although the SQC does not support stores, the TCP does. Thus, we could have applied a similar change to the GPU stores at the TCP. However, since the TCP support assumes write-through caches and does not attempt to allocate space in the TCP, we elected not to add this support since it seems to run contrary to the intended behavior (i.e., the intended behavior seems to be that writes just bypass the TCP and thus should not need to wait for another write to the same cache line to complete). Additionally, making these changes introduced issues with deadlocks at the TCC. Specifically, some Pannotia applications have accesses to the same cache line where some of the accesses are GLC (i.e., they bypass the GPU L1 cache) and others are non-GLC (i.e., they want to be cached in the GPU L1 cache). We have support already per CU in the above code. However, the problem here is that these requests are coming from different CUs and happening concurrently (seemingly because different WGs are at different points in the kernel around the same time). This causes a problem because our support at the TCC for the TBEs overwrites the information about the GPU bypassing bits (SLC, GLC) every time. The problem is when the second (non-GLC) load reaches the TCC, it overwrites the SLC/GLC information for the first (GLC) load. Thus, when the the first load returns from the directory/memory, it no longer has the GLC bit set, which causes an assert failure at the TCP. After talking with other developers, it was decided the best way handle this and attempt to model real hardware more closely was to move the point at which requests are put to sleep on the wakeup buffer from the TCC to the directory. Accordingly, this patch includes support for that -- now when multiple loads (bypassing or non-bypassing) from different CUs reach the directory, all but the first one will be forced to wait there until the first one completes, then will be woken up and performed. This required updating the WTRequestor information at the TCC to pass the information about what CU performed the original request for loads as well (otherwise since the TBE can be updated by multiple pending loads, we can't tell where to send the final result to). Thus, I changed the field to be named CURequestor instead of WTRequestor since it is now used for more than stores. Moreover, I also updated the directory to take this new field and the GLC information from incoming TCC requests and then pass that information back to the TCC on the response -- without doing this, because the TBE can be updated by multiple pending, concurrent requests we cannot determine if this memory request was a bypassing or non-bypassing request. Finally, these changes introduced a lot of additional contention and protocol stalls at the directory, so this patch converted all directory uses of z_stall to instead put requests on the wakeup buffer (and wake them up when the current request completes) instead. Without this, protocol stalls cause many applications to deadlock at the directory. However, this exposed another issue at the TCC: other applications (e.g., HACC) have a mix of atomics and non-atomics to the same cache line in the same kernel. Since the TCC transitions to the A state when an atomic arrives. For example, after the first pending load returns to the TCC from the directory, which causes the TCC state to become V, but when there are still other pending loads at the TCC. This causes invalid transition errors at the TCC when those pending loads return, because the A state thinks they are atomics and decrements the pending atomic count (plus the loads are never sent to the TCP as returning loads). This patch fixes this by changing the TCC TBEs to model the number of pending requests, and not allowing atomics to be issued from the TCC until all prior, pending non-atomic requests have returned. Change-Id: I37f8bda9f8277f2355bca5ef3610f6b63ce93563
2023-11-16 14:24:00 -08:00
parent bfe899e48e c3326c78e6
commit 4965367724
6 changed files with 414 additions and 128 deletions
--- a/src/mem/ruby/protocol/GPU_VIPER-SQC.sm
+++ b/src/mem/ruby/protocol/GPU_VIPER-SQC.sm
@@ -48,6 +48,9 @@ machine(MachineType:SQC, "GPU SQC (L1 I Cache)")
 {
  state_declaration(State, desc="SQC Cache States", default="SQC_State_I") {
    I, AccessPermission:Invalid, desc="Invalid";
+    // Note: currently IV in the TCP is only for pending loads to a given cache
+    // line.  Since the SQC is read only, there are no stores.
+    IV, AccessPermission:Invalid, desc="Going from I to V, waiting on TCC data";
    V, AccessPermission:Read_Only, desc="Valid";
  }

@@ -98,6 +101,7 @@ machine(MachineType:SQC, "GPU SQC (L1 I Cache)")
  void unset_tbe();
  void wakeUpAllBuffers();
  void wakeUpBuffers(Addr a);
+  void wakeUpAllBuffers(Addr a);
  Cycles curCycle();

  // Internal functions
@@ -270,6 +274,21 @@ machine(MachineType:SQC, "GPU SQC (L1 I Cache)")
    }
  }

+  action(t_allocateTBE, "t", desc="allocate TBE Entry") {
+    check_allocate(TBEs);
+    TBEs.allocate(address);
+    set_tbe(TBEs.lookup(address));
+  }
+
+  action(d_deallocateTBE, "d", desc="Deallocate TBE") {
+    TBEs.deallocate(address);
+    unset_tbe();
+  }
+
+  action(st_stallAndWaitRequest, "st", desc="Stall and wait on the address") {
+    stall_and_wait(mandatoryQueue_in, address);
+  }
+
  action(p_popMandatoryQueue, "pm", desc="Pop Mandatory Queue") {
    mandatoryQueue_in.dequeue(clockEdge());
  }
@@ -278,6 +297,10 @@ machine(MachineType:SQC, "GPU SQC (L1 I Cache)")
    responseToSQC_in.dequeue(clockEdge());
  }

+  action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
+    wakeUpAllBuffers(address);
+  }
+
  action(l_loadDoneHit, "ldh", desc="local load done (hits in SQC)") {
    assert(is_valid(cache_entry));
    sequencer.readCallback(address, cache_entry.DataBlk, true, MachineType:L1Cache);
@@ -313,22 +336,52 @@ machine(MachineType:SQC, "GPU SQC (L1 I Cache)")

  // Transitions

+  // if another request arrives for the same cache line that has a pending
+  // load, put it on the wakeup buffer.  This reduced resource contention since
+  // they won't try again every cycle and will instead only try again once woken
+  // up
+  transition(IV, {Fetch}) {
+      st_stallAndWaitRequest;
+  }
+
  // transitions from base
-  transition({I, V}, Repl, I) {TagArrayRead, TagArrayWrite} {
+  transition({I, IV, V}, Repl, I) {TagArrayRead, TagArrayWrite} {
    // since we're evicting something, don't bother classifying as hit/miss
    ic_invCache;
  }

-  transition(I, Data, V) {TagArrayRead, TagArrayWrite, DataArrayRead} {
+  // if we got a response for a load where the line is in I, then
+  // another request must have come in that replaced the line in question in
+  // the cache.  Thus, complete this request without allocating the line, but
+  // still deallocate TBE and wakeup any dependent addresses.
+  transition(I, Data) {TagArrayRead, TagArrayWrite, DataArrayRead} {
+    // don't profile this as a hit/miss since it's a reponse from L2,
+    // so we already counted it
+    l_loadDoneMiss;
+    wada_wakeUpAllDependentsAddr;
+    d_deallocateTBE;
+    pr_popResponseQueue;
+  }
+
+  // if line is currently in IV, then Data is returning the data for a
+  // pending load, so transition to V, deallocate TBE, and wakeup any dependent
+  // requests so they will be replayed now that this request has returned.
+  transition(IV, Data, V) {TagArrayRead, TagArrayWrite, DataArrayRead} {
    a_allocate;
    // don't profile this as a hit/miss since it's a reponse from L2,
    // so we already counted it
    w_writeCache;
    l_loadDoneMiss;
+    wada_wakeUpAllDependentsAddr;
+    d_deallocateTBE;
    pr_popResponseQueue;
  }

-  transition(I, Fetch) {TagArrayRead, TagArrayWrite} {
+  // if we have a load that misses, allocate TBE entry and transition to IV
+  // to prevent subsequent requests to same cache line from also going to TCC
+  // while this request is pending
+  transition(I, Fetch, IV) {TagArrayRead, TagArrayWrite} {
+    t_allocateTBE;
    nS_issueRdBlkS;
    uu_profileDataMiss; // since line wasn't in SQC, we missed
    p_popMandatoryQueue;
--- a/src/mem/ruby/protocol/GPU_VIPER-TCC.sm
+++ b/src/mem/ruby/protocol/GPU_VIPER-TCC.sm
@@ -61,6 +61,7 @@ machine(MachineType:TCC, "TCC Cache")
    WrVicBlk,               desc="L1 Write Through";
    WrVicBlkBack,           desc="L1 Write Through(dirty cache)";
    WrVicBlkEvict,          desc="L1 Write Through(dirty cache) and evict";
+    AtomicWait,             desc="Atomic Op that must wait for pending loads";
    Atomic,                 desc="Atomic Op";
    AtomicPassOn,           desc="Atomic Op Passed on to Directory";
    AtomicDone,             desc="AtomicOps Complete";
@@ -113,6 +114,7 @@ machine(MachineType:TCC, "TCC Cache")
    bool Shared,                     desc="Victim hit by shared probe";
    MachineID From,                  desc="Waiting for writeback from...";
    NetDest Destination,             desc="Data destination";
+    int numPending,                  desc="num pending requests";
    int numPendingDirectoryAtomics,  desc="number of pending atomics to be performed in directory";
    int atomicDoneCnt,               desc="number AtomicDones triggered";
    bool isGLCSet,                   desc="Bypass L1 Cache";
@@ -293,11 +295,14 @@ machine(MachineType:TCC, "TCC Cache")
      peek(responseFromNB_in, ResponseMsg, block_on="addr") {
        TBE tbe := TBEs.lookup(in_msg.addr);
        Entry cache_entry := getCacheEntry(in_msg.addr);
-        bool is_slc_set := false;
-
-        if (!is_invalid(tbe)) {
-            is_slc_set := tbe.isSLCSet;
-        }
+        /*
+          MOESI_AMD_Base-dir acts as the directory, and it always passes
+          SLC information back to L2 because of races at L2 with requests
+          from different CUs sending requests to same cache line in parallel.
+          If these requests have different GLC/SLC settings, the L2 TBE may
+          not have the correct GLC/SLC information for a given request.
+         */
+        bool is_slc_set := in_msg.isSLCSet;

        // Whether the SLC bit is set or not, WB acks should invoke the
        // WBAck event. For cases where a read response will follow a
@@ -372,16 +377,29 @@ machine(MachineType:TCC, "TCC Cache")
        } else if (in_msg.Type == CoherenceRequestType:Atomic ||
                   in_msg.Type == CoherenceRequestType:AtomicReturn ||
                   in_msg.Type == CoherenceRequestType:AtomicNoReturn) {
-          // If the request is system-level, if the address isn't in the cache,
-          // or if this cache is write-through, then send the request to the
-          // directory. Since non-SLC atomics won't be performed by the directory,
-          // TCC will perform the atomic on the return path on Event:Data.
-          // The action will invalidate the cache line if SLC is set and the address is
-          // in the cache.
-          if(in_msg.isSLCSet || !WB) {
-            trigger(Event:AtomicPassOn, in_msg.addr, cache_entry, tbe);
+	  /*
+	    If there are pending requests for this line already and those
+	    requests are not atomics, because we can't easily differentiate
+	    between different request types on return and because decrementing
+	    the atomic count assumes all returned requests in the A state are
+	    atomics, we will need to put this atomic to sleep and wake it up
+	    when the loads return.
+	   */
+	  if (is_valid(tbe) && (tbe.numPending > 0) &&
+	        (tbe.numPendingDirectoryAtomics == 0)) {
+            trigger(Event:AtomicWait, in_msg.addr, cache_entry, tbe);
          } else {
-            trigger(Event:Atomic, in_msg.addr, cache_entry, tbe);
+            // If the request is system-level, if the address isn't in the cache,
+            // or if this cache is write-through, then send the request to the
+            // directory. Since non-SLC atomics won't be performed by the directory,
+            // TCC will perform the atomic on the return path on Event:Data.
+            // The action will invalidate the cache line if SLC is set and the address is
+            // in the cache.
+            if(in_msg.isSLCSet || !WB) {
+              trigger(Event:AtomicPassOn, in_msg.addr, cache_entry, tbe);
+            } else {
+              trigger(Event:Atomic, in_msg.addr, cache_entry, tbe);
+            }
          }
        } else if (in_msg.Type == CoherenceRequestType:RdBlk) {
          if (in_msg.isSLCSet) {
@@ -433,24 +451,35 @@ machine(MachineType:TCC, "TCC Cache")
      out_msg.addr := address;
      out_msg.Type := CoherenceResponseType:TDSysResp;
      out_msg.Sender := machineID;
-      out_msg.Destination := tbe.Destination;
-      out_msg.DataBlk := cache_entry.DataBlk;
      out_msg.MessageSize := MessageSizeType:Response_Data;
      out_msg.Dirty := false;
      out_msg.State := CoherenceState:Shared;
-      DPRINTF(RubySlicc, "%s\n", out_msg);
      peek(responseFromNB_in, ResponseMsg) {
-        out_msg.isGLCSet := tbe.isGLCSet;
-        out_msg.isSLCSet := tbe.isSLCSet;
+        // if line state is Invalid, then we must be doing the transition(I, Data)
+        // so use the DataBlk from the incoming message
+        if ((getAccessPermission(address) == AccessPermission:NotPresent) ||
+	      (getAccessPermission(address) == AccessPermission:Invalid)) {
+          out_msg.DataBlk := in_msg.DataBlk;
+        } else {
+          out_msg.DataBlk := cache_entry.DataBlk;
+        }
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
+        // reuse CURequestor field to allow multiple concurrent loads and
+        // track where they should go back to (since TBE can't distinguish
+        // destinations)
+        out_msg.Destination.clear();
+        out_msg.Destination.add(in_msg.CURequestor);
      }
+      DPRINTF(RubySlicc, "%s\n", out_msg);
    }
    enqueue(unblockToNB_out, UnblockMsg, 1) {
      out_msg.addr := address;
      out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
      out_msg.MessageSize := MessageSizeType:Unblock_Control;
      peek(responseFromNB_in, ResponseMsg) {
-        out_msg.isGLCSet := tbe.isGLCSet;
-        out_msg.isSLCSet := tbe.isSLCSet;
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
      }
      DPRINTF(RubySlicc, "%s\n", out_msg);
    }
@@ -462,13 +491,17 @@ machine(MachineType:TCC, "TCC Cache")
          out_msg.addr := address;
          out_msg.Type := CoherenceResponseType:TDSysResp;
          out_msg.Sender := machineID;
-          out_msg.Destination := tbe.Destination;
+          // reuse CURequestor field to allow multiple concurrent loads and
+          // track where they should go back to (since TBE can't distinguish
+          // destinations)
+          out_msg.Destination.clear();
+          out_msg.Destination.add(in_msg.CURequestor);
          out_msg.DataBlk := in_msg.DataBlk;
          out_msg.MessageSize := MessageSizeType:Response_Data;
          out_msg.Dirty := false;
          out_msg.State := CoherenceState:Shared;
-          out_msg.isGLCSet := tbe.isGLCSet;
-          out_msg.isSLCSet := tbe.isSLCSet;
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
          DPRINTF(RubySlicc, "%s\n", out_msg);
        }
        enqueue(unblockToNB_out, UnblockMsg, 1) {
@@ -481,19 +514,25 @@ machine(MachineType:TCC, "TCC Cache")
  }

  action(rd_requestData, "r", desc="Miss in L2, pass on") {
-    if(tbe.Destination.count()==1){
-      peek(coreRequestNetwork_in, CPURequestMsg) {
-        enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
-          out_msg.addr := address;
-          out_msg.Type := in_msg.Type;
-          out_msg.Requestor := machineID;
-          out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
-          out_msg.Shared := false; // unneeded for this request
-          out_msg.MessageSize := in_msg.MessageSize;
-          out_msg.isGLCSet := tbe.isGLCSet;
-          out_msg.isSLCSet := tbe.isSLCSet;
-          DPRINTF(RubySlicc, "%s\n", out_msg);
-        }
+    peek(coreRequestNetwork_in, CPURequestMsg) {
+      DPRINTF(RubySlicc, "in_msg: %s\n", in_msg);
+      enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
+        out_msg.addr := address;
+        out_msg.Type := in_msg.Type;
+        out_msg.Requestor := machineID;
+        /*
+          To allow multiple concurrent requests from different CUs, we pass
+          the orgin information along to the directory, which stores it in its
+          TBE as appropriate before passing it back to the TCC on the return
+          path.
+         */
+        out_msg.CURequestor := in_msg.Requestor;
+        out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
+        out_msg.Shared := false; // unneeded for this request
+        out_msg.MessageSize := in_msg.MessageSize;
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
+        DPRINTF(RubySlicc, "out_msg: %s\n", out_msg);
      }
    }
  }
@@ -504,7 +543,7 @@ machine(MachineType:TCC, "TCC Cache")
        out_msg.addr := address;
        out_msg.Type := CoherenceResponseType:TDSysWBAck;
        out_msg.Destination.clear();
-        out_msg.Destination.add(in_msg.WTRequestor);
+        out_msg.Destination.add(in_msg.CURequestor);
        out_msg.Sender := machineID;
        out_msg.MessageSize := MessageSizeType:Writeback_Control;
        out_msg.instSeqNum := in_msg.instSeqNum;
@@ -562,7 +601,7 @@ machine(MachineType:TCC, "TCC Cache")
        enqueue(responseToCore_out, ResponseMsg, l2_response_latency) {
          out_msg.addr := address;
          out_msg.Type := CoherenceResponseType:TDSysResp;
-          out_msg.Destination.add(in_msg.WTRequestor);
+          out_msg.Destination.add(in_msg.CURequestor);
          out_msg.Sender := machineID;
          out_msg.MessageSize := in_msg.MessageSize;
          out_msg.DataBlk := cache_entry.DataBlk;
@@ -578,12 +617,12 @@ machine(MachineType:TCC, "TCC Cache")
        enqueue(responseToCore_out, ResponseMsg, l2_response_latency) {
          out_msg.addr := address;
          out_msg.Type := CoherenceResponseType:TDSysResp;
-          out_msg.Destination.add(in_msg.WTRequestor);
+          out_msg.Destination.add(in_msg.CURequestor);
          out_msg.Sender := machineID;
          out_msg.MessageSize := in_msg.MessageSize;
          out_msg.DataBlk := in_msg.DataBlk;
-          out_msg.isGLCSet := tbe.isGLCSet;
-          out_msg.isSLCSet := tbe.isSLCSet;
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
        }
    }
  }
@@ -611,7 +650,10 @@ machine(MachineType:TCC, "TCC Cache")
      tbe.Destination.clear();
      tbe.numPendingDirectoryAtomics := 0;
      tbe.atomicDoneCnt := 0;
+      tbe.numPending := 0;
    }
+    // each pending requests increments this count by 1
+    tbe.numPending := tbe.numPending + 1;
    if (coreRequestNetwork_in.isReady(clockEdge())) {
      peek(coreRequestNetwork_in, CPURequestMsg) {
        if(in_msg.Type == CoherenceRequestType:RdBlk ||
@@ -620,6 +662,16 @@ machine(MachineType:TCC, "TCC Cache")
           in_msg.Type == CoherenceRequestType:AtomicNoReturn){
          tbe.Destination.add(in_msg.Requestor);
        }
+        /*
+          If there are multiple concurrent requests to the same cache line, each
+          one will overwrite the previous ones GLC/SLC information here.
+          If these requests have different GLC/SLC information, this causes
+          a segfault.  Hence, currently the support relies on the directory to
+          pass back the GLC/SLC information instead of relying on L2 TBE to be
+          correct.
+
+          This message is left here as an FYI for future developers.
+         */
        tbe.isGLCSet := in_msg.isGLCSet;
        tbe.isSLCSet := in_msg.isSLCSet;
        if(in_msg.Type == CoherenceRequestType:Atomic ||
@@ -633,9 +685,14 @@ machine(MachineType:TCC, "TCC Cache")
  }

  action(dt_deallocateTBE, "dt", desc="Deallocate TBE entry") {
-    tbe.Destination.clear();
-    TBEs.deallocate(address);
-    unset_tbe();
+    // since we may have multiple destinations, can't deallocate if we aren't
+    // last one
+    tbe.numPending := tbe.numPending - 1;
+    if (tbe.numPending == 0) {
+      tbe.Destination.clear();
+      TBEs.deallocate(address);
+      unset_tbe();
+    }
  }

  action(wcb_writeCacheBlock, "wcb", desc="write data to TCC") {
@@ -672,7 +729,7 @@ machine(MachineType:TCC, "TCC Cache")
      enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
        out_msg.addr := address;
        out_msg.Requestor := machineID;
-        out_msg.WTRequestor := in_msg.Requestor;
+        out_msg.CURequestor := in_msg.Requestor;
        out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
        out_msg.MessageSize := MessageSizeType:Data;
        out_msg.Type := CoherenceRequestType:WriteThrough;
@@ -680,6 +737,8 @@ machine(MachineType:TCC, "TCC Cache")
        out_msg.DataBlk := in_msg.DataBlk;
        out_msg.writeMask.orMask(in_msg.writeMask);
        out_msg.instSeqNum := in_msg.instSeqNum;
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
      }
    }
  }
@@ -688,7 +747,7 @@ machine(MachineType:TCC, "TCC Cache")
    enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
      out_msg.addr := address;
      out_msg.Requestor := machineID;
-      out_msg.WTRequestor := machineID;
+      out_msg.CURequestor := machineID;
      out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
      out_msg.MessageSize := MessageSizeType:Data;
      out_msg.Type := CoherenceRequestType:WriteThrough;
@@ -703,13 +762,15 @@ machine(MachineType:TCC, "TCC Cache")
      enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
        out_msg.addr := address;
        out_msg.Requestor := machineID;
-        out_msg.WTRequestor := in_msg.Requestor;
+        out_msg.CURequestor := in_msg.Requestor;
        out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
        out_msg.MessageSize := MessageSizeType:Data;
        out_msg.Type := CoherenceRequestType:WriteFlush;
        out_msg.Dirty := true;
        out_msg.DataBlk := cache_entry.DataBlk;
        out_msg.writeMask.orMask(cache_entry.writeMask);
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
      }
    }
  }
@@ -719,7 +780,7 @@ machine(MachineType:TCC, "TCC Cache")
      enqueue(requestToNB_out, CPURequestMsg, l2_request_latency) {
        out_msg.addr := address;
        out_msg.Requestor := machineID;
-        out_msg.WTRequestor := in_msg.Requestor;
+        out_msg.CURequestor := in_msg.Requestor;
        out_msg.Destination.add(mapAddressToMachine(address, MachineType:Directory));
        out_msg.MessageSize := MessageSizeType:Data;
        out_msg.Type := in_msg.Type;
@@ -768,9 +829,17 @@ machine(MachineType:TCC, "TCC Cache")
    wakeUpAllBuffers(address);
  }

+  /*
+    Currently z_stall is unused because it can lead to Protocol Stalls that
+    eventually lead to deadlock.  Instead, it is recommended to use
+    st_stallAndWaitRequest in combination with a wakeupBuffer call (e.g.,
+    wada_wakeUpAllDependentsAddr) to put the pending requests to sleep instead of
+    them causing head of line blocking -- wada_wakeUpAllDependentsAddr should wake
+    the request up once the request preventing it from completing is done.
  action(z_stall, "z", desc="stall") {
      // built-in
  }
+  */


  action(inpa_incrementNumPendingDirectoryAtomics, "inpa", desc="inc num atomics") {
@@ -792,8 +861,8 @@ machine(MachineType:TCC, "TCC Cache")
        out_msg.addr := address;
        out_msg.Type := TriggerType:AtomicDone;
        peek(responseFromNB_in, ResponseMsg) {
-          out_msg.isGLCSet := tbe.isGLCSet;
-          out_msg.isSLCSet := tbe.isSLCSet;
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
        }
      }
    }
@@ -832,31 +901,53 @@ machine(MachineType:TCC, "TCC Cache")
  // they can cause a resource stall deadlock!

  transition(WI, {RdBlk, WrVicBlk, Atomic, AtomicPassOn, WrVicBlkBack}) { //TagArrayRead} {
-      // by putting the stalled requests in a buffer, we reduce resource contention
-      // since they won't try again every cycle and will instead only try again once
-      // woken up
+      // don't profile as hit or miss since it will be tried again
+      /*
+        By putting the stalled requests in a buffer, we reduce resource contention
+        since they won't try again every cycle and will instead only try again once
+        woken up.
+       */
      st_stallAndWaitRequest;
  }
  transition(WIB, {RdBlk, WrVicBlk, Atomic, WrVicBlkBack}) { //TagArrayRead} {
-      // by putting the stalled requests in a buffer, we reduce resource contention
-      // since they won't try again every cycle and will instead only try again once
-      // woken up
+      // don't profile as hit or miss since it will be tried again
+      /*
+        By putting the stalled requests in a buffer, we reduce resource contention
+        since they won't try again every cycle and will instead only try again once
+        woken up.
+       */
      st_stallAndWaitRequest;
  }
  transition(A, {RdBlk, WrVicBlk, WrVicBlkBack}) { //TagArrayRead} {
-      // by putting the stalled requests in a buffer, we reduce resource contention
-      // since they won't try again every cycle and will instead only try again once
-      // woken up
+      // don't profile as hit or miss since it will be tried again
+      /*
+        By putting the stalled requests in a buffer, we reduce resource contention
+        since they won't try again every cycle and will instead only try again once
+        woken up.
+       */
      st_stallAndWaitRequest;
  }

  transition(IV, {WrVicBlk, Atomic, AtomicPassOn, WrVicBlkBack}) { //TagArrayRead} {
-      // by putting the stalled requests in a buffer, we reduce resource contention
-      // since they won't try again every cycle and will instead only try again once
-      // woken up
+      // don't profile as hit or miss since it will be tried again
+      /*
+        By putting the stalled requests in a buffer, we reduce resource contention
+        since they won't try again every cycle and will instead only try again once
+        woken up.
+       */
      st_stallAndWaitRequest;
  }

+  transition({I, IV, V}, AtomicWait) {
+    // don't profile as hit or miss since it will be tried again
+    /*
+      By putting the stalled requests in a buffer, we reduce resource contention
+      since they won't try again every cycle and will instead only try again once
+      woken up.
+     */
+    st_stallAndWaitRequest;
+  }
+
  transition({M, V}, RdBlk) {TagArrayRead, DataArrayRead} {
    p_profileHit;
    sd_sendData;
@@ -865,12 +956,15 @@ machine(MachineType:TCC, "TCC Cache")
  }

  transition(W, RdBlk, WI) {TagArrayRead, DataArrayRead} {
+    // don't profile as hit or miss since it will be tried again
    t_allocateTBE;
    wb_writeBack;
-    // need to try this request again after writing back the current entry -- to
-    // do so, put it with other stalled requests in a buffer to reduce resource
-    // contention since they won't try again every cycle and will instead only
-    // try again once woken up
+    /*
+      Need to try this request again after writing back the current entry -- to
+      do so, put it with other stalled requests in a buffer to reduce resource
+      contention since they won't try again every cycle and will instead only
+      try again once woken up.
+     */
    st_stallAndWaitRequest;
  }

@@ -933,6 +1027,7 @@ machine(MachineType:TCC, "TCC Cache")
  // Transition to be called when a read request with SLC flag arrives at entry
  // in transient state. The request stalls until the pending transition is complete.
  transition({WI, WIB, IV}, RdBypassEvict)  {
+    // don't profile as hit or miss since it will be tried again
    st_stallAndWaitRequest;
  }

@@ -945,8 +1040,8 @@ machine(MachineType:TCC, "TCC Cache")
    p_popRequestQueue;
  }

-  transition(A, Atomic) {
-    p_profileMiss;
+  transition(A, {Atomic, AtomicWait}) {
+    // don't profile as hit or miss since it will be tried again
    // by putting the stalled requests in a buffer, we reduce resource contention
    // since they won't try again every cycle and will instead only try again once
    // woken up
@@ -993,7 +1088,7 @@ machine(MachineType:TCC, "TCC Cache")
  }

  transition(A, AtomicPassOn) {
-    p_profileMiss;
+    // don't profile as hit or miss since it will be tried again
    // by putting the stalled requests in a buffer, we reduce resource contention
    // since they won't try again every cycle and will instead only try again once
    // woken up
@@ -1136,9 +1231,41 @@ machine(MachineType:TCC, "TCC Cache")
    ut_updateTag;
    wcb_writeCacheBlock;
    sdr_sendDataResponse;
-    pr_popResponseQueue;
    wada_wakeUpAllDependentsAddr;
    dt_deallocateTBE;
+    pr_popResponseQueue;
+  }
+
+  /*
+    Since the L2 now allows multiple loads from different CUs to proceed in
+    parallel to the directory, we may get Event:Data back when the line is
+    already in V.  In this case, send the response to the appropriate TCP
+    and update MRU/data in TCC, but don't need to allocate line.
+   */
+  transition(V, Data) {TagArrayRead, TagArrayWrite, DataArrayWrite} {
+    ut_updateTag;
+    wcb_writeCacheBlock;
+    sdr_sendDataResponse;
+    wada_wakeUpAllDependentsAddr;
+    // tracks # pending requests, so need to decrement here too
+    dt_deallocateTBE;
+    pr_popResponseQueue;
+  }
+
+  /*
+    Since the L2 now allows multiple loads from different CUs to proceed in
+    parallel to the directory, we may get Event:Data back when the line is
+    now in I because it has been evicted by an intervening request to the same
+    set index.  In this case, send the response to the appropriate TCP without
+    affecting the TCC (essentially, treat it similar to a bypass request except
+    we also send the unblock back to the directory).
+   */
+  transition(I, Data) {
+    sdr_sendDataResponse;
+    wada_wakeUpAllDependentsAddr;
+    // tracks # pending requests, so need to decrement here too
+    dt_deallocateTBE;
+    pr_popResponseQueue;
  }

  transition(A, Data, M) {TagArrayRead, TagArrayWrite, DataArrayWrite, AtomicALUOperation} {
--- a/src/mem/ruby/protocol/GPU_VIPER-TCP.sm
+++ b/src/mem/ruby/protocol/GPU_VIPER-TCP.sm
@@ -1,5 +1,6 @@
 /*
 * Copyright (c) 2011-2015 Advanced Micro Devices, Inc.
+ * Copyright (c) 2023 Matthew D. Sinclair
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
@@ -53,10 +54,14 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
 {
  state_declaration(State, desc="TCP Cache States", default="TCP_State_I") {
    I, AccessPermission:Invalid,   desc="Invalid";
+    // Note: currently IV in the TCP is only for pending loads to a given cache
+    // line. Since the TCP is write through, stores should be allowed to pass
+    // through without requiring them to wait.
+    IV, AccessPermission:Invalid,  desc="Going from I to V, waiting on TCC data";
    V, AccessPermission:Read_Only, desc="Valid";
    A, AccessPermission:Invalid,   desc="Waiting on Atomic";

-    F, AccessPermission:Invalid, desc="Flushing; Waiting for Ack";
+    F, AccessPermission:Invalid,   desc="Flushing; Waiting for Ack";
  }

  enumeration(Event, desc="TCP Events") {
@@ -102,6 +107,8 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    bool Dirty,        desc="Is the data dirty (different than memory)?";
    int NumPendingMsgs,desc="Number of acks/data messages that this processor is waiting for";
    bool Shared,       desc="Victim hit by shared probe";
+    bool isGLCSet,     desc="Bypass L1 Cache";
+    bool isSLCSet,     desc="Bypass L1 and L2 Cache";
   }

  structure(TBETable, external="yes") {
@@ -123,6 +130,7 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
  void unset_tbe();
  void wakeUpAllBuffers();
  void wakeUpBuffers(Addr a);
+  void wakeUpAllBuffers(Addr a);
  Cycles curCycle();

  // Internal functions
@@ -292,10 +300,13 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
        TBE tbe := TBEs.lookup(in_msg.LineAddress);
        DPRINTF(RubySlicc, "%s\n", in_msg);
        if (in_msg.Type == RubyRequestType:LD) {
-          if ((in_msg.isGLCSet || in_msg.isSLCSet) && is_valid(cache_entry)) {
-            // Read requests with GLC or SLC bit set should not cache in the L1.
-            // They need to bypass the L1 and go to the L2. If an entry exists
-            // in the L1, it needs to be evicted
+          // Read requests with GLC or SLC bit set should not cache in the L1.
+          // They need to bypass the L1 and go to the L2.  If an entry exists in
+          // the L1, it needs to be evicted, and if no entry or invalid entry in
+          // the L1, still need to bypass.  The LoadBypassEvict Event handles
+          // both cases in its transitions below, so call LoadBypassEvict for
+          // both.
+          if ((in_msg.isGLCSet || in_msg.isSLCSet)) {
            trigger(Event:LoadBypassEvict, in_msg.LineAddress, cache_entry, tbe);
          }
          else {
@@ -469,6 +480,15 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    check_allocate(TBEs);
    TBEs.allocate(address);
    set_tbe(TBEs.lookup(address));
+
+    // pass GLC/SLC information along
+    if (mandatoryQueue_in.isReady(clockEdge())) {
+      peek(mandatoryQueue_in, RubyRequest) {
+        DPRINTF(RubySlicc, "Address: %p\n", address);
+        tbe.isGLCSet := in_msg.isGLCSet;
+        tbe.isSLCSet := in_msg.isSLCSet;
+      }
+    }
  }

  action(d_deallocateTBE, "d", desc="Deallocate TBE") {
@@ -507,6 +527,10 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    responseToTCP_in.dequeue(clockEdge());
  }

+  action(st_stallAndWaitRequest, "st", desc="Stall and wait on the address") {
+    stall_and_wait(mandatoryQueue_in, address);
+  }
+
  action(l_loadDoneHit, "ldh", desc="local load done (hits in TCP)") {
    assert(is_valid(cache_entry));
    if (use_seq_not_coal) {
@@ -525,6 +549,20 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    }
  }

+  action(ldmi_loadDoneMissInv, "ldmi",
+         desc="local load done (misses in TCP and line was evicted)") {
+    // since line was evicted, can't rely on data from cache entry, so use from
+    // the response message
+    peek(responseToTCP_in, ResponseMsg) {
+      DataBlock tmp:= in_msg.DataBlk;
+      if (use_seq_not_coal) {
+        sequencer.readCallback(address, tmp, false, MachineType:L1Cache);
+      } else {
+        coalescer.readCallback(address, MachineType:L1Cache, tmp);
+      }
+    }
+  }
+
  action(ad_atomicDone, "ad", desc="atomic done") {
    assert(is_valid(cache_entry));
    coalescer.atomicCallback(address, MachineType:L1Cache, cache_entry.DataBlk);
@@ -601,6 +639,10 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    L1cache.setMRU(address);
  }

+  action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
+    wakeUpAllBuffers(address);
+  }
+
 //  action(zz_recycleMandatoryQueue, "\z", desc="recycle mandatory queue") {
 //    mandatoryQueue_in.recycle(clockEdge(), cyclesToTicks(recycle_latency));
 //  }
@@ -629,11 +671,19 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
  // Stalling transitions do NOT check the tag array...and if they do,
  // they can cause a resource stall deadlock!

-  transition({A}, {Load, Atomic, StoreThrough}) { //TagArrayRead} {
-      z_stall;
+  // if another request arrives for the same cache line that has a pending
+  // atomic or load, put it on the wakeup buffer instead of z_stall'ing it.  By
+  // doing so we reduce resource contention since they won't try again every cycle
+  // and will instead only try again once woken up
+  transition({A, IV}, {Load, LoadBypassEvict, Atomic, Store, StoreThrough, Flush}) {
+      st_stallAndWaitRequest;
  }

-  transition(I, Load) {TagArrayRead} {
+  // if we have a load that misses, allocate TBE entry and transition to IV
+  // to prevent subsequent requests to same cache line from also going to TCC
+  // while this request is pending
+  transition(I, Load, IV) {TagArrayRead} {
+    t_allocateTBE;
    n_issueRdBlk;
    uu_profileDataMiss;
    p_popMandatoryQueue;
@@ -691,14 +741,38 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    p_popMandatoryQueue;
  }

-  transition(I, TCC_Ack, V) {TagArrayRead, TagArrayWrite, DataArrayRead, DataArrayWrite} {
-    a_allocate;
-    w_writeCache;
-    l_loadDoneMiss;
+  // if we got a response for a load where the line is in I, then
+  // another request must have come in that replaced the line in question in
+  // the cache.  Thus, complete this request without allocating the line, but
+  // still deallocate TBE and wakeup any dependent addresses.
+  // (Note: this assumes TCC_AckWB is what stores use)
+  transition(I, TCC_Ack) {TagArrayRead, TagArrayWrite} {
+    wada_wakeUpAllDependentsAddr;
+    // NOTE: Because we invalidated the cache line, the assert in l_loadDoneMiss
+    // will fail -- unlike atomics that automatically go to I when the line returns
+    // loads do not automatically go to I.  Resolve this by passing data from
+    // message.
+    ldmi_loadDoneMissInv;
+    d_deallocateTBE;
    pr_popResponseQueue;
  }

-  transition(I, Bypass, I) {
+  // if line is currently in IV, then TCC_Ack is returning the data for a
+  // pending load, so transition to V, deallocate TBE, and wakeup any dependent
+  // requests so they will be replayed now that this request has returned.
+  transition(IV, TCC_Ack, V) {TagArrayRead, TagArrayWrite, DataArrayRead, DataArrayWrite} {
+    a_allocate;
+    w_writeCache;
+    wada_wakeUpAllDependentsAddr;
+    l_loadDoneMiss;
+    d_deallocateTBE;
+    pr_popResponseQueue;
+  }
+
+  // if a bypass request arrives back at the TCP, regardless of whether the line
+  // is in I (from the bypass request) or IV (from a subsequent non-bypassing
+  // load), retain the current state and complete the bypassing request.
+  transition({I, IV}, Bypass) {
    rb_bypassDone;
    pr_popResponseQueue;
  }
@@ -710,12 +784,13 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
  }

  transition(A, TCC_Ack, I) {TagArrayRead, DataArrayRead, DataArrayWrite} {
-    d_deallocateTBE;
    a_allocate;
    w_writeCache;
    ad_atomicDone;
-    pr_popResponseQueue;
    ic_invCache;
+    wada_wakeUpAllDependentsAddr;
+    d_deallocateTBE;
+    pr_popResponseQueue;
  }

  transition(V, TCC_Ack, V) {TagArrayRead, DataArrayRead, DataArrayWrite} {
@@ -732,20 +807,22 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    ic_invCache;
  }

+  // if a line with a pending load gets evicted, transition the line to I and
+  // invalidate it.
+  transition(IV, Repl, I) {TagArrayRead, TagArrayWrite} {
+    ic_invCache;
+  }
+
  transition({V,I}, Flush, F) {TagArrayFlash} {
    a_allocate;
    sf_setFlush;
    p_popMandatoryQueue;
  }

-  transition(A, Flush) {
-    z_stall;
-  }
-
  transition({I, V}, Evict, I) {TagArrayFlash} {
    inv_invDone;
-    p_popMandatoryQueue;
    ic_invCache;
+    p_popMandatoryQueue;
  }

  transition(A, Evict) {TagArrayFlash} {
@@ -753,8 +830,11 @@ machine(MachineType:TCP, "GPU TCP (L1 Data Cache)")
    p_popMandatoryQueue;
  }

+  // if a line is in IV and a TCC_AckWB comes back, we must have had a WT
+  // store followed by a load. Thus, complete the store without affecting
+  // TBE or line state.
  // TCC_AckWB only snoops TBE
-  transition({V, I, A}, TCC_AckWB) {
+  transition({V, I, IV, A}, TCC_AckWB) {
    wd_wtDone;
    pr_popResponseQueue;
  }
--- a/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
+++ b/src/mem/ruby/protocol/MOESI_AMD_Base-dir.sm
@@ -154,7 +154,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
    bool Dirty,         desc="Is the data dirty?";
    int NumPendingAcks,        desc="num acks expected";
    MachineID OriginalRequestor,        desc="Original Requestor";
-    MachineID WTRequestor,        desc="WT Requestor";
+    MachineID CURequestor,        desc="CU that initiated the request";
    bool Cached,        desc="data hit in Cache";
    bool MemData,       desc="Got MemData?",default="false";
    bool wtData,       desc="Got write through data?",default="false";
@@ -170,7 +170,9 @@ machine(MachineType:Directory, "AMD Baseline protocol")
    uint64_t probe_id,        desc="probe id for lifetime profiling";
    WriteMask writeMask,    desc="outstanding write through mask";
    int Len,            desc="Length of memory request for DMA";
-    bool isSLCSet,      desc="Bypass L1 and L2 Cache";
+    // GLC is passed along because it is needed in the return path
+    bool isGLCSet,      desc="Bypass GPU L1 Cache";
+    bool isSLCSet,      desc="Bypass GPU L1 and L2 Cache";
  }

  structure(TBETable, external="yes") {
@@ -470,6 +472,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
      out_msg.ForwardRequestTime := tbe.ForwardRequestTime;
      out_msg.ProbeRequestStartTime := tbe.ProbeRequestStartTime;
      out_msg.OriginalResponder := tbe.LastSender;
+      out_msg.CURequestor := tbe.CURequestor;
      out_msg.L3Hit := tbe.L3Hit;
      DPRINTF(RubySlicc, "%s\n", out_msg);
    }
@@ -498,6 +501,9 @@ machine(MachineType:Directory, "AMD Baseline protocol")
      out_msg.ProbeRequestStartTime := tbe.ProbeRequestStartTime;
      out_msg.OriginalResponder := tbe.LastSender;
      out_msg.L3Hit := tbe.L3Hit;
+      out_msg.isGLCSet := tbe.isGLCSet;
+      out_msg.isSLCSet := tbe.isSLCSet;
+      out_msg.CURequestor := tbe.CURequestor;
      DPRINTF(RubySlicc, "%s\n", out_msg);
    }
  }
@@ -527,9 +533,11 @@ machine(MachineType:Directory, "AMD Baseline protocol")
        out_msg.ForwardRequestTime := tbe.ForwardRequestTime;
        out_msg.ProbeRequestStartTime := tbe.ProbeRequestStartTime;
        out_msg.OriginalResponder := tbe.LastSender;
-        if(tbe.atomicData){
-          out_msg.WTRequestor := tbe.WTRequestor;
-        }
+        out_msg.isGLCSet := tbe.isGLCSet;
+        out_msg.isSLCSet := tbe.isSLCSet;
+	if(tbe.atomicData){
+          out_msg.CURequestor := tbe.CURequestor;
+	}
        out_msg.L3Hit := tbe.L3Hit;
        DPRINTF(RubySlicc, "%s\n", out_msg);
      }
@@ -555,6 +563,8 @@ machine(MachineType:Directory, "AMD Baseline protocol")
        out_msg.InitialRequestTime := tbe.InitialRequestTime;
        out_msg.ForwardRequestTime := curCycle();
        out_msg.ProbeRequestStartTime := tbe.ProbeRequestStartTime;
+        out_msg.isGLCSet := tbe.isGLCSet;
+        out_msg.isSLCSet := tbe.isSLCSet;
        DPRINTF(RubySlicc, "%s\n", out_msg);
      }
  }
@@ -565,13 +575,15 @@ machine(MachineType:Directory, "AMD Baseline protocol")
        out_msg.addr := address;
        out_msg.Type := CoherenceResponseType:NBSysWBAck;
        out_msg.Destination.add(in_msg.Requestor);
-        out_msg.WTRequestor := in_msg.WTRequestor;
+        out_msg.CURequestor := in_msg.CURequestor;
        out_msg.Sender := machineID;
        out_msg.MessageSize := MessageSizeType:Writeback_Control;
        out_msg.InitialRequestTime := in_msg.InitialRequestTime;
        out_msg.ForwardRequestTime := curCycle();
        out_msg.ProbeRequestStartTime := curCycle();
        out_msg.instSeqNum := in_msg.instSeqNum;
+        out_msg.isGLCSet := in_msg.isGLCSet;
+        out_msg.isSLCSet := in_msg.isSLCSet;
      }
    }
  }
@@ -582,7 +594,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
        out_msg.addr := address;
        out_msg.Type := CoherenceResponseType:NBSysWBAck;
        out_msg.Destination.add(tbe.OriginalRequestor);
-        out_msg.WTRequestor := tbe.WTRequestor;
+        out_msg.CURequestor := tbe.CURequestor;
        out_msg.Sender := machineID;
        out_msg.MessageSize := MessageSizeType:Writeback_Control;
        out_msg.InitialRequestTime := tbe.InitialRequestTime;
@@ -773,6 +785,8 @@ machine(MachineType:Directory, "AMD Baseline protocol")
          out_msg.MessageSize := MessageSizeType:Control;
          out_msg.Destination := probe_dests;
          tbe.NumPendingAcks := out_msg.Destination.count();
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
          DPRINTF(RubySlicc, "%s\n", out_msg);
          APPEND_TRANSITION_COMMENT(" dc: Acks remaining: ");
          APPEND_TRANSITION_COMMENT(tbe.NumPendingAcks);
@@ -877,6 +891,8 @@ machine(MachineType:Directory, "AMD Baseline protocol")
          out_msg.MessageSize := MessageSizeType:Control;
          out_msg.Destination := probe_dests;
          tbe.NumPendingAcks := out_msg.Destination.count();
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
          DPRINTF(RubySlicc, "%s\n", (out_msg));
          APPEND_TRANSITION_COMMENT(" sc: Acks remaining: ");
          APPEND_TRANSITION_COMMENT(tbe.NumPendingAcks);
@@ -931,6 +947,8 @@ machine(MachineType:Directory, "AMD Baseline protocol")
          out_msg.ReturnData := false;
          out_msg.MessageSize := MessageSizeType:Control;
          out_msg.Destination := probe_dests;
+          out_msg.isGLCSet := in_msg.isGLCSet;
+          out_msg.isSLCSet := in_msg.isSLCSet;
          tbe.NumPendingAcks := out_msg.Destination.count();
          APPEND_TRANSITION_COMMENT(" ic: Acks remaining: ");
          APPEND_TRANSITION_COMMENT(tbe.NumPendingAcks);
@@ -1017,7 +1035,7 @@ machine(MachineType:Directory, "AMD Baseline protocol")
        tbe.writeMask.clear();
        tbe.writeMask.orMask(in_msg.writeMask);
        tbe.wtData := true;
-        tbe.WTRequestor := in_msg.WTRequestor;
+        tbe.CURequestor := in_msg.CURequestor;
        tbe.LastSender := in_msg.Requestor;
      }
      if (in_msg.Type == CoherenceRequestType:Atomic ||
@@ -1032,10 +1050,14 @@ machine(MachineType:Directory, "AMD Baseline protocol")
          assert(in_msg.Type == CoherenceRequestType:AtomicNoReturn);
          tbe.atomicDataNoReturn := true;
        }
-        tbe.WTRequestor := in_msg.WTRequestor;
+        tbe.CURequestor := in_msg.CURequestor;
        tbe.LastSender := in_msg.Requestor;
        tbe.isSLCSet := in_msg.isSLCSet;
      }
+      // GPU read requests also need to track where the requestor came from
+      if (in_msg.Type == CoherenceRequestType:RdBlk) {
+        tbe.CURequestor := in_msg.CURequestor;
+      }
      tbe.Dirty := false;
      if (in_msg.Type == CoherenceRequestType:WriteThrough) {
        tbe.DataBlk.copyPartial(in_msg.DataBlk,in_msg.writeMask);
@@ -1045,6 +1067,9 @@ machine(MachineType:Directory, "AMD Baseline protocol")
      tbe.NumPendingAcks := 0;
      tbe.Cached := in_msg.ForceShared;
      tbe.InitialRequestTime := in_msg.InitialRequestTime;
+      tbe.isGLCSet := in_msg.isGLCSet;
+      tbe.isSLCSet := in_msg.isSLCSet;
+      DPRINTF(RubySlicc, "t_allocateTBE in_msg: %s, tbe: %s\n", in_msg, tbe.CURequestor);
    }
  }

@@ -1277,11 +1302,20 @@ machine(MachineType:Directory, "AMD Baseline protocol")
  }

  action(wada_wakeUpAllDependentsAddr, "wada", desc="Wake up any requests waiting for this address") {
+    DPRINTF(RubySlicc, "wada wakeup: 0x%x\n", address);
    wakeUpAllBuffers(address);
  }

+  /*
+    Currently z_stall is unused because it can lead to Protocol Stalls that
+    eventually lead to deadlock.  Instead, it is recommended to use
+    st_stallAndWaitRequest in combination with a wakeupBuffer call (e.g.,
+    wada_wakeUpAllDependentsAddr) to put the pending requests to sleep instead of
+    them causing head of line blocking -- wada_wakeUpAllDependentsAddr should wake
+    the request up once the request preventing it from completing is done.
  action(z_stall, "z", desc="...") {
  }
+  */

  // TRANSITIONS
  transition({BL, BDR_M, BDW_M, BS_M, BM_M, B_M, BP, BDR_PM, BDW_PM, BS_PM, BM_PM, B_PM, BDR_Pm, BDW_Pm, BS_Pm, BM_Pm, B_Pm, B}, {RdBlkS, RdBlkM, RdBlk, CtoD}) {
@@ -1383,19 +1417,19 @@ machine(MachineType:Directory, "AMD Baseline protocol")
    d_writeDataToMemory;
    al_allocateL3Block;
    pr_profileL3HitMiss; //Must come after al_allocateL3Block and before dt_deallocateTBE
-    wad_wakeUpDependents;
+    wada_wakeUpAllDependentsAddr;
    dt_deallocateTBE;
    pr_popResponseQueue;
  }

  transition(BL, StaleWB, U) {L3TagArrayWrite} {
    dt_deallocateTBE;
-    wa_wakeUpAllDependents;
+    wada_wakeUpAllDependentsAddr;
    pr_popResponseQueue;
  }

  transition({B, BDR_M, BDW_M, BS_M, BM_M, B_M, BP, BDR_PM, BDW_PM, BS_PM, BM_PM, B_PM, BDR_Pm, BDW_Pm, BS_Pm, BM_Pm, B_Pm}, {VicDirty, VicClean}) {
-    z_stall;
+    st_stallAndWaitRequest;
  }

  transition({U, BL, BDR_M, BDW_M, BS_M, BM_M, B_M, BP, BDR_PM, BDW_PM, BS_PM, BM_PM, B_PM, BDR_Pm, BDW_Pm, BS_Pm, BM_Pm, B_Pm, B}, WBAck) {
--- a/src/mem/ruby/protocol/MOESI_AMD_Base-msg.sm
+++ b/src/mem/ruby/protocol/MOESI_AMD_Base-msg.sm
@@ -134,14 +134,14 @@ structure(CPURequestMsg, desc="...", interface="Message") {
  int Acks, default="0", desc="Acks that the dir (mem ctrl) should expect to receive";
  CoherenceRequestType OriginalType, default="CoherenceRequestType_NA",  desc="Type of request from core fwded through region buffer";
  WriteMask writeMask, desc="Write Through Data";
-  MachineID WTRequestor,            desc="Node who initiated the write through";
+  MachineID CURequestor,            desc="Node who initiated the request";
  int wfid,                         default="0", desc="wavefront id";
  uint64_t instSeqNum,              desc="instruction sequence number";
  bool NoWriteConflict,             default="true", desc="write collided with CAB entry";
  int ProgramCounter,               desc="PC that accesses to this block";

-  bool isGLCSet, default="false", desc="GLC flag value in the request";
-  bool isSLCSet, default="false", desc="SLC flag value in the request";
+  bool isGLCSet, default="false",   desc="GLC flag value in the request";
+  bool isSLCSet, default="false",   desc="SLC flag value in the request";

  bool functionalRead(Packet *pkt) {
    // Only PUTX messages contains the data block
@@ -170,6 +170,8 @@ structure(NBProbeRequestMsg, desc="...", interface="Message") {
  MachineID Requestor,          desc="Requestor id for 3-hop requests";
  bool NoAckNeeded, default="false", desc="For short circuting acks";
  int ProgramCounter,           desc="PC that accesses to this block";
+  bool isGLCSet, default="false", desc="GLC flag value in the request";
+  bool isSLCSet, default="false", desc="SLC flag value in the request";

  bool functionalRead(Packet *pkt) {
    return false;
@@ -240,7 +242,7 @@ structure(ResponseMsg, desc="...", interface="Message") {

  bool L3Hit, default="false", desc="Did memory or L3 supply the data?";
  MachineID OriginalResponder, desc="Mach which wrote the data to the L3";
-  MachineID WTRequestor,             desc="Node who started the writethrough";
+  MachineID CURequestor,             desc="Node who started the access";

  bool NotCached, default="false", desc="True when the Region buffer has already evicted the line";

--- a/src/mem/ruby/system/GPUCoalescer.cc
+++ b/src/mem/ruby/system/GPUCoalescer.cc
@@ -526,26 +526,16 @@ GPUCoalescer::readCallback(Addr address,
    fatal_if(crequest->getRubyType() != RubyRequestType_LD,
             "readCallback received non-read type response\n");

-    // Iterate over the coalesced requests to respond to as many loads as
-    // possible until another request type is seen. Models MSHR for TCP.
-    while (crequest->getRubyType() == RubyRequestType_LD) {
-        hitCallback(crequest, mach, data, true, crequest->getIssueTime(),
-                    forwardRequestTime, firstResponseTime, isRegion);
-
-        delete crequest;
-        coalescedTable.at(address).pop_front();
-        if (coalescedTable.at(address).empty()) {
-            break;
-        }
-
-        crequest = coalescedTable.at(address).front();
-    }
+    hitCallback(crequest, mach, data, true, crequest->getIssueTime(),
+                forwardRequestTime, firstResponseTime, isRegion);

+    delete crequest;
+    coalescedTable.at(address).pop_front();
    if (coalescedTable.at(address).empty()) {
-        coalescedTable.erase(address);
+      coalescedTable.erase(address);
    } else {
-        auto nextRequest = coalescedTable.at(address).front();
-        issueRequest(nextRequest);
+      auto nextRequest = coalescedTable.at(address).front();
+      issueRequest(nextRequest);
    }
 }