Real Cache Coherence Protocols

DASH Coherence Protocol

Overview

The DASH multiprocessor uses a two-level coherence protocol, relying on a snoopy bus to ensure cache coherence within cluster and a directory-based protocol to ensure coherence across clusters. The protocol uses a Remote Access Cache (RAC) at each cluster, which essentially consolidates memory blocks from remote clusters into a single cache on the local snoopy bus. When a request is issued for a block from a remote cluster that is not in the RAC, the request is denied but the request is also forwarded to the owner. The owner supplies the block to the RAC. Eventually, when the requestor retries, the block will be waiting in the RAC.

Because the local snoopy bus of the home cluster maintains coherence with the home memory, the directory does not keep track of the block's status in the home cluster. Instead, the directory keeps track of which remote clusters are holding the block. This is in contrast to the protocols discussed by Solihin 2008 which do not incorporate a local bus and therefore use directories which track the state of the block across all caches. The DASH directories are memory-based and use a full bit vector representation.

DASH uses a three state invalidation protocol for the caches with Dirty, Shared, and Invalid states. This is essentially the Modified-Shared-Invalid (MSI) protocol that Solihin 2008 describes. The three directory states are Dirty-Remote, Shared-Remote, and Uncached-Remote. These correspond roughly to the Exclusive/Modified, Shared, and Uncached directory states discussed in Chapter 11 of Solihin 2008. However, whereas the Exclusive/Modified state only indicates that a block could be dirty, a block in the Dirty-Remote state is always considered dirty. Furthermore, a block that is dirty in its home cluster will be in state Uncached-Remote in the directory. So the correspondence is inexact.

The DASH coherence protocol supports three different memory requests: Read, Writeback, and Read-Exclusive. Read and Read-Exclusive are equivalent to the Read and ReadX requests present in Solihin 2008. Writeback is similar to a Flush. However, a writeback is used when a cache wants to evict a dirty block. A writeback from a remote cluster ends with the block entering the Uncached-Remote state in the directory.

There are too many cases to express the protocol succinctly as a finite state machine. Instead, pseudocode for handling Read and Read-Exclusive requests from the appendix of Lenoski et al. 1990 follows. The pseudocode refers to a number of hardware subsystems that are beyond the scope of this article—the directory controller (DC), pseudo-CPU (PCPU), and reply controller (RC). Descriptions may be found in Lenoski et al. 1990.

Read

 // "Figure 7: Normal flow of read request bus transaction."
 // From Lenoski et al. 1990 (see References).
 
 if (Data held locally in shared state by processor or RAC)
   Other cache(s) supply data for fill;
 
 else if (Data held locally in dirty state by processor or RAC) {
   Dirty cache supplies data for fill and goes to shared state;
   if (Memory Home is Local)
     Writeback Data to main memory;
   else
     RAC takes data in shared-dirty state;
   }
 
 else if (Memory home is Local) {
   if (Directory entry state != Dirty-Remote)
     Memory supplies read data;
   else {
     Allocate RAC entry, mask arbitration and force retry;
     Forward Read Request to Dirty Cluster;
     PCPU on Dirty Cluster issues read request;
     Dirty cache supplies data and does to shared state;
     DC sends shared data reply to local cluster;
     Local RC gets reply and unmasks processor arbitration;
     Upon local processor read, RC supplies data and the
       RAC entry goes to shared state;
     Directory entry state = Shared-Remote;
     }
   }
 
 else /* Memory home is Remote */
   Allocate RAC entry, mask arbitration and force retry;
   Local DC sends read request to home cluster;
   if (Directory entry state != Dirty-Remote) {
     Directory entry state = Shared-Remote, update vector;
     Home DC sends reply to local RC;
     Local RC gets reply and unmasks processor arbitration;
   else {
     Home DC forwards Read Request to dirty cluster;
     PCPU on dirty cluster issues read request and DC sends
       reply to local cluster and sharing writeback to home;
     Local RC gets reply and unmasks processor arbitration;
     Home DC gets sharing writeback, writes back dirty data,
       Directory entry state = Shared-Remote, update vector;
     }
   Upon local processor read, RC supplies the data and the
     RAC entry goes to shared state;
   }

When a Read request is issued, it first goes to the snoopy bus of the local cluster. If the block is already in the local RAC or in a local processor's cache, then one of these caches supplies the data. If the block was dirty and the local cluster is the home cluster, then the owner will also have to flush to main memory and transition to the shared state. If the block was dirty and the local cluster is a remote cluster, then the owner flushes to the local RAC which becomes the new owner. In this last case, locally the block will be shared, but remotely it will be considered dirty.

If the data is not already held locally, but the local cluster is the home cluster and the block is not held in the dirty state by a remote cluster, then memory supplies the data. However, if the block is dirty in a remote cluster, the bus request is denied, prompting a retry. Meanwhile, the request is fowarded to the owning cluster. The owner flushes the block and transitions to the shared state. The directory transitions to the shared state as well.

If the data is not already held locally and the local cluster is not the home cluster, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the home cluster. When the block is owned by a remote cluster, the request is forwarded to the owning cluster and the owner flushes the block and transitions to the shared state. The home cluster sends the data to the requestor's RAC and the directory transitions to the shared state.

ReadX

 // "Figure 9: Normal flow of read-exclusive request bus transaction."
 // From Lenoski et al. 1990 (see References).
 
 if (Data held locally in dirty state by processor or RAC)
   Dirty cache supplies Read-Exclusive fill data and
     invalidates self;
 
 else if (Memory Home is Local) {
   switch (Directory entry state) {
 
     case Uncached-Remote :
       Memory supplies data, any locally cached copies
         are invalidated;
       break;
 
     case Shared-Remote:
       RC allocates an entry in RAC with DC specified
         invalidate acknowledge count;
       Memory supplies data, any locally cached copies are
         invalidated;
       Local DC sends invalidate request to shared clusters;
       Dir. entry state = Uncached-Remote, update vector;
       Upon receipt of all acknowledges RC deallocates RAC
         entry;
       break;
 
     case Dirty-Remote :
       Allocate RAC entry, mask arbitration and force retry;
       Forward Read-Exclusive Request to dirty cluster;
       PCPU at dirty cluster issues Read-Ex request,
         Dirty cache supplies data and invalidates self;
       DC in dirty cluster sends reply to local RC;
       Local RC gets reply from dirty cluster and unmasks
         processor arbitration;
       Upon local processor re-Read-Ex, RC supplies data,
         RAC entry is deallocated and
         Dir. entry state = Uncached-Remote, update vector;
     }
   }
 else /* Memory Home is Remote */ {
   RC allocates RAC entry, masks arbitration and forces retry;
   Local DC sends Read-Exclusive request to home;
   switch (Directory entry state) {
 
     case Uncached-Remote :
       Home memory supplies data, any locally cached copies
         are invalidated, Home DC sends reply to local RC;
       Directory entry state = Dirty-Remote, update vector;
       Local RC gets Read-Ex reply with zero invalidation
         count and unmasks processor for arbitration;
       Upon local processor re-Read-Ex, RC supplies data and
         RAC entry is deallocated;
       break;
 
     case Shared-Remote :
       Home memory supplies data, any locally cached copies
         are invalidated, Home DC sends reply to local RC;
       Home DC sends invalidation requests to sharing
         clusters;
       Directory entry state = Dirty-Remote, update vector;
       Local RC gets reply with data and invalidate acknow-
         ledge count and unmasks processor for arbitration;
       Upon local processor re-Read-Ex, RC supplies data;
       Upon receipt of all acknowledge RC deallocates RAC
         entry;
       break;
 
     case Dirty-Remote :
       Home DC forwards Read-Ex request to dirty cluster;
       PCPU at dirty cluster issues Read-Ex request,
         Dirty cache supplies data and invalidates self;
       DC in dirty cluster sends reply to local RC with
         acknowledge count of one and sends Dirty Transfer
         request to home;
       Local RC gets reply and acknowledge count and unmasks
         processor for arbitration;
       Upon local processor re-Read-Ex, RC supplies data;
       Upon receipt of Dirty Transfer request, Home DC
         sends acknowledgement to local RC,
         Home Dir. entry state = Dirty-Remote, update vector;
       Upon receipt of acknowledge RC deallocates RAC entry;
     }
   }

When a ReadX request is issued, it first goes to the snoopy bus of the local cluster. If the block is being held dirty in the local RAC or in a local processor's cache, then the owner supplies the data and transitions to the invalid state.

Otherwise, if the local cluster is the home cluster and the block is in the uncached or shared state in the directory, then memory supplies the data and other local caches transistion to the invalid state. If the block is in the shared state in the directory, then the directory must also send out invalidation requests to the remote clusters that are in the sharing vector and transition the block to the uncached state in the directory. However, if the block is in the dirty state in the directory, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the owning cluster. The owner flushes and transitions to the invalid state. The directory transitions the block to the uncached state.

Otherwise, if the local cluster is not the home cluster, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the home cluster. If the block is in the uncached or shared state in the directory, the home cluster will send the data to the requestor's RAC and the directory transitions to the dirty state. If the block is in the shared state in the directory, the directory must also send invalidation requests to the sharing clusters. However, if the block is in the dirty state in the directory, the request is forwarded to the owning cluster. The owner sends the dirty block to the RAC of the requestor and an acknowledgement to home and transitions to the invalid state.

Handling Races

Lenoski et al. 1990 discuss two race conditions that may arise in the DASH multiprocessor. These essentially identical to two of the race conditions discussed in Chapter 11 of Solihin 2008. The first occurs when a Read from requester R is forwarded from home H to owner O, but O sends a Writeback to H before the forwarded Read arrives. This situation is diagrammed below.

 R --> Read --> H --> Read (Forward) --> O
 
                  O --> Writeback --> H

In this case, O no longer holds a valid copy of the data to send to R. Instead, O can simply deny the request by sending R a NAK (negative acknowledgement) message. R can then resend the Read request to H. Since ownership of the block returned to H while the forwarded Read was in transit, H can respond to R with the data. This is the same solution proposed by Solihin 2008. It is possible for R to starve if ownership of the block continues to be passed around, but this is unlikely to continue for long.

Another possible race occurs when the home node H replies with data (ReplyD) to a Read from requester R but an invalidation (Inv) arrives first.

 R --> Read --> H -----> ReplyD -----> R
 
     A --> ReadX --> H --> Inv --> R

Handled naively, R would invalidate the block in its cache but then write the stale data into its cache when it arrived later violating cache coherence. Instead, DASH uses the snoopy RAC to watch for invalidations of pending Reads. Any reply to a Read that was pending when an Inv of the block was observed is considered to be stale and handled by sending a NAK to R (rather than forwarding the data).

This is subtly different from the remedy proposed by Solihin 2008 which also relies on keeping track of outstanding transactions. Whereas DASH converts state replies to NAKs, Solihin 2008 suggests that the requester delay responding to the Inv until after the response to the pending Read arrives.

While Lenoski et al. 1990 express confidence that the DASH protocol handles all races correctly, they also suggest that the protocol is too complex to validate analytically. Instead, they tested it empirically against parallel programs with known results. This leaves open the possibility that races which violate cache coherence exist.

LimitLESS: Alewife Coherence Protocol

"Annotation of the state transition diagram," from Chaiken *et al.* 1991.
Transition Label	Input Message	Precondition	Directory Entry Change	Output Message(s)
1	i → RREQ	—	P = P ∪ {i}	RDATA → i
2	i → WREQ i → WREQ	P = {i} P = {}	— P = {i}	WDATA → i WDATA → i
3	i → WREQ i → WREQ	P = {k₁, …, k_n} ∧ i ∉ P P = {k₁, …, k_n} ∧ i ∈ P	P = {i}, AckCtr = n P = {i}, AckCtr = n - 1	∀k_j INV → k_j ∀k_j ≠ i INV → k_j
4	j → WREQ	P = {i}	P = {j}, AckCtr = 1	INV → i
5	j → RREQ	P = {i}	P = {j}, AckCtr = 1	INV → i
6	i → REPM	P = {i}	P = {}	—
7	j → RREQ j → WREQ j → ACKC j → REPM	— — AckCtr ≠ 1 —	— — AckCtr = AckCtr - 1 —	BUSY → j BUSY → j — —
8	j → ACKC j → UPDATE	AckCtr = 1, P = {i} P = {i}	AckCtr = 0 AckCtr = 0	WDATA → i WDATA → i
9	j → RREQ j → WREQ j → REPM

References

Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy (1990). "The directory-based cache coherence protocol for the DASH multiprocessor." In Proceedings of the 17th Annual International Symposium on Computer Architecture.

CSC/ECE 506 Spring 2010/ch 11 maf

Contents

Real Cache Coherence Protocols

DASH Coherence Protocol

Overview

Read

ReadX

Handling Races

LimitLESS: Alewife Coherence Protocol

References

Navigation menu

CSC/ECE 506 Spring 2010/ch 11 maf

Real Cache Coherence Protocols

DASH Coherence Protocol

Overview

Read

ReadX

Handling Races

LimitLESS: Alewife Coherence Protocol

References

Navigation menu

Search