CSC/ECE 506 Fall 2007/wiki3 8 38
Wiki: SCI. The IEEE Scalable Coherent Interface is a superset of the SSCI protocol we have been considering in class. A lot has been written about it, but it is still difficult to comprehend. Using SSCI as a starting point, explain why additional states are necessary, and give (or cite) examples that demonstrate how they work. Ideally, this would still be an overview of the working of the protocol, referencing more detailed documentation on the Web.
Introduction
Scalable Coherent Interface (SCI)
The scalable coherent interface has now become a chief hardware based approach to the cache coherence problem in shared memory multiprocessors. SCI is a directory based invalidate coherence protocol. The state of a cache block is distributed to the sharers of that block. Limits that are inherent in bus technology are easily avoided by SCI. The SCI protocol is to provide scalability, coherence and an interface. Scalability is to guarantee that the same mechanisms can be used in single processor systems and large highly parallel multiprocessors. Coherence is to guarantee efficient and integral use of cache memories in distributed shared memory. An interface provides a communication architecture that has multiple values to be brought into a single system and provide smooth inter-operation.
In SCI, every interface does not wait for the signal to propagate before it begins to send the next signal. Also, SCI utilizes multiple links so that, concurrently, several transfers can take place.
Usually, a directory entry is in either of the two states : ‘home’ or ‘gone’. If the state is in ‘home’, then memory can immediately satisfy requests to a block as the block has not been cached by any processor. If the state is ‘gone’, then the block has been cached by a processor and might even be modified. Now, the directory contains a pointer to the first processor in the sharing list for this particular block. Hence, on requesting the data, memory returns the pointer to the first processor on the shared list rather than the data itself. The processor asking for the block now forwards its request to the processor on the top of the shared list. Now, the requesting processor adds itself into the shared list as the new head of the list.
In the SCI protocol, any coherent transaction has three phases.
Memory read – When a processor misses in its cache, It asks for the block in the home directory in memory. If the state of the memory is ‘home’, then the main memory replies with the block to the requesting processor. If the state is ‘gone’, then the main memory returns the head of the shared list of processors for that particular block. Then, memory updates its pointer and puts the requesting processor as the new head.
Cache read – When memory returns a pointer toe the requesting processor instead of data, the processor forwards its request for the block to the head of the doubly linked list. When the cache receives, the head of the list returns the data which might have been modified. The head of the list changes its backward pointer to the requesting processor’s node. Now, the requesting processor becomes the head of the list and it has the cache block.
Cleanup – If the cache miss from the requesting processor is a store, the processor has to first invalidate all other cached copies and then only proceed with the store. The new head gives an invalidate request to the next address on the list i.e. to its next processor. This processor invalidates and gives back a pointer to the next processor on its list. The head of the list uses this new pointer and sends it an invalidate request. This goes on until a NULL pointer is returned. This is the cleanup process for invalidation of other cache copies of the block.
Simple Scalable Coherent Interface (SSCI)
In directory-based approach, Every memory block has associated directory information; it keeps track of copies of cached blocks and their states. On a miss, it finds the directory entry, looks it up, and communicates only with the nodes that have copies (if necessary).
There are mainly two approaches: Full-bit vector: For k processors, it maintains k presence bit and 1 dirty bit at the home node. Cache state is represented the same way as in bus-based designs (MSI, MESI, etc.). It has three cache states: EM (exclusive or modified), S (shared), U (unowned). Limitation is - Number of presence bits needed grows as the number of processors.
Memory-based schemes store the information about all cached copies at the home node of the block. Cache-based schemes distribute information about copies among the copies themselves. The home contains a pointer to one cached copy of the block. Each copy contains the identity of the next node that has a copy of the block. The location of the copies is therefore determined through network transactions.
Simple SCI (SSCI) retains similarity with full-bit vector protocol: MESI states in the cache; U, S, EM states in the memory directory; It replaces the presence bits with a pointer.
Why additional states are necessary?
Examples
3 DASH Cache Coherence Protocol
DASH (Directory Architecture for SHared memory) is a scalable shared-memory multiprocessor currently being developed at Stanford’s Computer Systems Laboratory. DASH protocol uses point-to-point messages sent between the processors and memories to keep caches consistent.
The DASH coherence protocol is an invalidation-based ownership protocol. A memory block can be in one of three states as indicated by the associated directory entry: (i) uncached-remote, that is not cached by any remote cluster; (ii) shared-remote, that is cached in an unmodified state by one or more remote clusters; or (iii) dirty-remote, that is cached in a modified state by a single remote cluster.
Please see in the figures below: Left - Flow of Read Request to remote memory with directory in dirty-remote state. Right - Flow of Read-Exclusive Request to remote memory with directory in shared-remote state.
Write back request: A dirty cache line that is replaced must be written back to memory. If the home of the memory block is the local cluster, then the data is simply written back to main memory. If the home cluster is remote, then a message is sent to the remote home which updates the main memory and marks the block uncached-remote.
2 Convex SPP-1000
The Convex SPP-1000 is the first commercial implementation of a new generation of scalable shared memory parallel computers with full cache coherence. It employs a hierarchical structure of processing communication and memory name-space management resources to provide a scalableNUMA environment. Ensembles of 8 HP PA-RISC7100 microprocessorsemploy an internal cross-bar switch and directory based cache coherence scheme to provide a tightly coupled SMP.Up to 16 processing ensembles are interconnected by a 4 ring network incorporating a full hardware implementation of the SCI protocol for a full system configuration of 128 processors.