CSC/ECE 506 Spring 2010/ch 11 maf
Real Cache Coherence Protocols
DASH Coherence Protocol
The DASH multiprocessor used a two-level coherence protocol, relying on a snoopy bus to ensure cache coherence within cluster and a directory-based protocol to ensure coherence across clusters. The protocol uses a Remote Access Cache (RAC) at each cluster, which essentially consolidates memory blocks from remote clusters into a single cache on the local snoopy bus. When a request is issued for a block from a remote cluster that is not in the RAC, the request is denied but the request is also forwarded to the owner. The owner supplies the block to the RAC. Eventually, when the requestor retries, the block will be waiting in the RAC.
There are too many cases to express the protocol succinctly as a finite state machine. Instead, the protocol for servicing Read and ReadX requests is summarized in the following subsections. For a more detailed elaboration, Lenoski et al. describe the protocol in pseudo-code.
Read
// "Figure 7: Normal flow of read request bus transaction." // From Lenoski et al. (see References). if (Data held locally in shared state by processor or RAC) Other cache(s) supply data for fill; else if (Data held locally in dirty state by processor or RAC) { Dirty cache supplies data for fill and goes to shared state; if (Memory Home is Local) Writeback Data to main memory; else RAC takes data in shared-dirty state; } else if (Memory home is Local) { if (Directory entry state != Dirty-Remote) Memory supplies read data; else { Allocate RAC entry, mask arbitration and force retry; Forward Read Request to Dirty Cluster; PCPU on Dirty Cluster issues read request; Dirty cache supplies data and does to shared state; DC sends shared data reply to local cluster; Local RC gets reply and unmasks processor arbitration; Upon local processor read, RC supplies data and the RAC entry goes to shared state; Directory entry state = Shared-Remote; } } else /* Memory home is Remote */ Allocate RAC entry, mask arbitration and force retry; Local DC sends read request to home cluster; if (Directory entry state != Dirty-Remote) { Directory entry state = Shared-Remote, update vector; Home DC sends reply to local RC; Local RC gets reply and unmasks processor arbitration; else { Home DC forwards Read Request to dirty cluster; PCPU on dirty cluster issues read request and DC sends reply to local cluster and sharing writeback to home; Local RC gets reply and unmasks processor arbitration; Home DC gets sharing writeback, writes back dirty data, Directory entry state = Shared-Remote, update vector; } Upon local processor read, RC supplies the data and the RAC entry goes to shared state; }
When a Read request is issued, it first goes to the snoopy bus of the local cluster. If the block is already in the local RAC or in a local processor's cache, then one of these caches supplies the data. If the block was dirty and the local cluster is the home cluster, then the owner will also have to flush to main memory and transition to the shared state. If the block was dirty and the local cluster is a remote cluster, then the owner flushes to the local RAC which becomes the new owner. In this last case, locally the block will be shared, but remotely it will be considered dirty.
If the data is not already held locally, but the local cluster is the home cluster and the block is not held in the dirty state by a remote cluster, then memory supplies the data. However, if the block is dirty in a remote cluster, the bus request is denied, prompting a retry. Meanwhile, the request is fowarded to the owning cluster. The owner flushes the block and transitions to the shared state. The directory transitions to the shared state as well.
If the data is not already held locally and the local cluster is not the home cluster, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the home cluster. When the block is owned by a remote cluster, the request is forwarded to the owning cluster and the owner flushes the block and transitions to the shared state. The home cluster sends the data to the requestor's RAC and the directory transitions to the shared state.
ReadX
if (Data held locally in dirty state by processor or RAC) Dirty cache supplies Read-Exclusive fill data and invalidates self; else if (Memory Home is Local) { switch (Directory entry state) { case Uncached-Remote : Memory supplies data, any locally cached copies are invalidated; break; case Shared-Remote: RC allocates an entry in RAC with DC specified invalidate acknowledge count; Memory supplies data, any locally cached copies are invalidated; Local DC sends invalidate request to shared clusters; Dir. entry state = Uncached-Remote, update vector; Upon receipt of all acknowledges RC deallocates RAC entry; break; case Dirty-Remote : Allocate RAC entry, mask arbitration and force retry; Forward Read-Exclusive Request to dirty cluster; PCPU at dirty cluster issues Read-Ex request, Dirty cache supplies data and invalidates self; DC in dirty cluster sends reply to local RC; Local RC gets reply from dirty cluster and unmasks processor arbitration; Upon local processor re-Read-Ex, RC supplies data, RAC entry is deallocated and Dir. entry state = Uncached-Remote, update vector; } } else /* Memory Home is Remote */ { RC allocates RAC entry, masks arbitration and forces retry; Local DC sends Read-Exclusive request to home; switch (Directory entry state) { case Uncached-Remote : Home memory supplies data, any locally cached copies are invalidated, Home DC sends reply to local RC; Directory entry state = Dirty-Remote, update vector; Local RC gets Read-Ex reply with zero invalidation count and unmasks processor for arbitration; Upon local processor re-Read-Ex, RC supplies data and RAC entry is deallocated; break; case Shared-Remote : Home memory supplies data, any locally cached copies are invalidated, Home DC sends reply to local RC; Home DC sends invalidation requests to sharing clusters; Directory entry state = Dirty-Remote, update vector; Local RC gets reply with data and invalidate acknow- ledge count and unmasks processor for arbitration; Upon local processor re-Read-Ex, RC supplies data; Upon receipt of all acknowledge RC deallocates RAC entry; break; case Dirty-Remote : Home DC forwards Read-Ex request to dirty cluster; PCPU at dirty cluster issues Read-Ex request, Dirty cache supplies data and invalidates self; DC in dirty cluster sends reply to local RC with acknowledge count of one and sends Dirty Transfer request to home; Local RC gets reply and acknowledge count and unmasks processor for arbitration; Upon local processor re-Read-Ex, RC supplies data; Upon receipt of Dirty Transfer request, Home DC sends acknowledgement to local RC, Home Dir. entry state = Dirty-Remote, update vector; Upon receipt of acknowledge RC deallocates RAC entry; } }
When a ReadX request is issued, it first goes to the snoopy bus of the local cluster. If the block is being held dirty in the local RAC or in a local processor's cache, then the owner supplies the data and transitions to the invalid state.
Otherwise, if the local cluster is the home cluster and the block is in the uncached or shared state in the directory, then memory supplies the data and other local caches transistion to the invalid state. If the block is in the shared state in the directory, then the directory must also send out invalidation requests to the remote clusters that are in the sharing vector and transition the block to the uncached state in the directory. However, if the block is in the dirty state in the directory, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the owning cluster. The owner flushes and transitions to the invalid state. The directory transitions the block to the uncached state.
Otherwise, if the local cluster is not the home cluster, the bus request is denied, prompting a retry. Meanwhile, the request is forwarded to the home cluster. If the block is in the uncached or shared state in the directory, the home cluster will send the data to the requestor's RAC and the directory transitions to the dirty state. If the block is in the shared state in the directory, the directory must also send invalidation requests to the sharing clusters. However, if the block is in the dirty state in the directory, the request is forwarded to the owning cluster. The owner sends the dirty block to the RAC of the requestor and an acknowledgement to home and transitions to the invalid state.
References
- Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, and John Hennessy (1990). "The directory-based cache coherence protocol for the DASH multiprocessor." In Proceedings of the 17th Annual International Symposium on Computer Architecture.