CSC/ECE 506 Fall 2007/wiki3 8 38
Wiki: SCI. The IEEE Scalable Coherent Interface is a superset of the SSCI protocol we have been considering in class. A lot has been written about it, but it is still difficult to comprehend. Using SSCI as a starting point, explain why additional states are necessary, and give (or cite) examples that demonstrate how they work. Ideally, this would still be an overview of the working of the protocol, referencing more detailed documentation on the Web.
Introduction
Scalable Coherent Interface (SCI)
The scalable coherent interface has now become a chief hardware based approach to the cache coherence problem in shared memory multiprocessors. SCI is a directory based invalidate coherence protocol. The state of a cache block is distributed to the sharers of that block. Limits that are inherent in bus technology are easily avoided by SCI. The SCI protocol is to provide scalability, coherence and an interface. Scalability is to guarantee that the same mechanisms can be used in single processor systems and large highly parallel multiprocessors. Coherence is to guarantee efficient and integral use of cache memories in distributed shared memory. An interface provides a communication architecture that has multiple values to be brought into a single system and provide smooth inter-operation.
In SCI, every interface does not wait for the signal to propagate before it begins to send the next signal. Also, SCI utilizes multiple links so that, concurrently, several transfers can take place.
Usually, a directory entry is in either of the two states : ‘home’ or ‘gone’. If the state is in ‘home’, then memory can immediately satisfy requests to a block as the block has not been cached by any processor. If the state is ‘gone’, then the block has been cached by a processor and might even be modified. Now, the directory contains a pointer to the first processor in the sharing list for this particular block. Hence, on requesting the data, memory returns the pointer to the first processor on the shared list rather than the data itself. The processor asking for the block now forwards its request to the processor on the top of the shared list. Now, the requesting processor adds itself into the shared list as the new head of the list.
In the SCI protocol, any coherent transaction has three phases.
Memory read – When a processor misses in its cache, It asks for the block in the home directory in memory. If the state of the memory is ‘home’, then the main memory replies with the block to the requesting processor. If the state is ‘gone’, then the main memory returns the head of the shared list of processors for that particular block. Then, memory updates its pointer and puts the requesting processor as the new head.
Cache read – When memory returns a pointer toe the requesting processor instead of data, the processor forwards its request for the block to the head of the doubly linked list. When the cache receives, the head of the list returns the data which might have been modified. The head of the list changes its backward pointer to the requesting processor’s node. Now, the requesting processor becomes the head of the list and it has the cache block.
Cleanup – If the cache miss from the requesting processor is a store, the processor has to first invalidate all other cached copies and then only proceed with the store. The new head gives an invalidate request to the next address on the list i.e. to its next processor. This processor invalidates and gives back a pointer to the next processor on its list. The head of the list uses this new pointer and sends it an invalidate request. This goes on until a NULL pointer is returned. This is the cleanup process for invalidation of other cache copies of the block.
Simple Scalable Coherent Interface (SSCI)
In directory-based approach, Every memory block has associated directory information; it keeps track of copies of cached blocks and their states. On a miss, it finds the directory entry, looks it up, and communicates only with the nodes that have copies (if necessary).
There are mainly two approaches: Full-bit vector: For k processors, it maintains k presence bit and 1 dirty bit at the home node. Cache state is represented the same way as in bus-based designs (MSI, MESI, etc.). It has three cache states: EM (exclusive or modified), S (shared), U (unowned). Limitation is - Number of presence bits needed grows as the number of processors.
Memory-based schemes store the information about all cached copies at the home node of the block. Cache-based schemes distribute information about copies among the copies themselves. The home contains a pointer to one cached copy of the block. Each copy contains the identity of the next node that has a copy of the block. The location of the copies is therefore determined through network transactions.
Simple SCI (SSCI) retains similarity with full-bit vector protocol: MESI states in the cache; U, S, EM states in the memory directory; It replaces the presence bits with a pointer.
Why additional states are necessary?
Correctness (Coherence & Consistency) requirement necessiates the additional states in the protocol.
On a scalable multiprocessors without coherent caches, the main-memory module determined the ordering of writes. The order that writes become visible to all processors is the order in which they reached memory.
e.g. If two processors issue read-exclusive requests for a particular word, the home will provide the requestors with the location of the dirty node. But which request will reach the dirty node first cannot be guaranteed. This creates the need for additional busy state in the directory.
This can be solved by
- holding requests at home or requestor node and serve them in the order of their arrival
- If the block is busy, reject any further request to it & that request will be retried later
- If directory is busy forward request to dirty node & dirty node will serialize the request execution
In non-coherent scalable multiprocessors,
- For write atomicity in invalidation based protocol, current owner of block has to wait until it recieve all invalidation acks and then only it can read/write new value.
- For write completion, current owner of the block need to wait for ack from memory
Examples
1 DASH Cache Coherence Protocol
DASH (Directory Architecture for SHared memory) is a scalable shared-memory multiprocessor currently being developed at Stanford’s Computer Systems Laboratory. DASH protocol uses point-to-point messages sent between the processors and memories to keep caches consistent.
The DASH coherence protocol is an invalidation-based ownership protocol. A memory block can be in one of three states as indicated by the associated directory entry: (i) uncached-remote, that is not cached by any remote cluster; (ii) shared-remote, that is cached in an unmodified state by one or more remote clusters; or (iii) dirty-remote, that is cached in a modified state by a single remote cluster.
Please see in the figures below: Left - Flow of Read Request to remote memory with directory in dirty-remote state. Right - Flow of Read-Exclusive Request to remote memory with directory in shared-remote state.
Write back request: A dirty cache line that is replaced must be written back to memory. If the home of the memory block is the local cluster, then the data is simply written back to main memory. If the home cluster is remote, then a message is sent to the remote home which updates the main memory and marks the block uncached-remote.
References
1 The DASH Cache Coherence Protocol
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Anoop Gupta, John Hennessy May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th annual international symposium on Computer Architecture ISCA '90, Volume 18 Issue 3a Publisher: ACM Press