CSC/ECE 506 Fall 2007/wiki3 2 aY3w
SCI. The IEEE Scalable Coherent Interface is a superset of the SSCI protocol we have been considering in class. A lot has been written about it, but it is still difficult to comprehend. Using SSCI as a starting point, explain why additional states are necessary, and give (or cite) examples that demonstrate how they work. Ideally, this would still be an overview of the working of the protocol, referencing more detailed documentation on the Web.
A Brief Overview of the SSCI Protocol
The Simple Scalable Coherence Interface (SSCI) protocol is a simplified version of the SCI protocol. While the SCI protocol has 29 stable states and many pending states, the SSCI protocol has only 3 states. Consequently, many race conditions are possible. Here we give a brief overview of the SSCI protocol, and elaborate on the possible race conditions in the final section.
As previously states, in the SSCI protocol the memory directory is in one of three states. They are:
- UNOWNED - the block is unowned by any cache.
- SHARED - the block is owned by multiple caches. Main memory is up to date.
- EXCLUSIVE/MODIFIED - the (possibly dirty) block is owned by a single cache. Main memory is possibly out of date.
A Brief Overview of the SCI Protocol
Memory Directory States
Similar to the SSCI protocol, in SCI the memory directory is one of three states:
- HOME - no remote cache contains a copy of the block.
- FRESH - one of more remote caches may have a read-only copy of the block. Main memory is up to date.
- GONE - the cache at the head of the sharing list contains an exclusive (possibly dirty) copy of the block.
Cache States
Additionally, each cache has a state, either stable or pending. Stable states have two parts:
- Where the cache entry is located in the sharing list for that block
- ONLY - singleton list
- HEAD - head of sharing list of more than one node
- MID - not head or tail of sharing list of more than one node
- TAIL - tail of sharing list of more than one node
- State of the cached block
- DIRTY - modified and writable
- CLEAN - unmodified and writable, same contents as memory
- FRESH - data is readable and writable, but memory must be informed before memory is written
- COPY - unmodified and readable
- many others
These two parts together describe the cache state, e.g. FRESH-ONLY or MID-VALID.
List Operations
SCI defines three primitive list operations for shared lists:
- List construction - adding a new node to the head of the sharing list
- Rollout - removing a node from the sharing list
- Invalidation - invalidate all other sharers. Can only be performed by the head node.
Why Additional States are Needed
The principle difference between SCI and SSCI is that SCI provides mechanisms for resolving race conditions and preserving serialization.
Example 1
Suppose some processor, say P1, issues a read request and misses on some block that is not cached by any processor. So, the block will have to be fetched from main memory (or possibly some remote cache), and the line will be put into P1's local cache. Space will have to be allocated in P1's cache, usually by evicting some other line.
In the SSCI protocol, P1 will send a READ message to the home directory. The home directory will reply with the data and set its state to EM, and its head pointer to 1. Upon receipt, P1 will put the block in the allocated line, set its state to E (from the MESI protocol), and its next/previous pointers to 0.
Consider what might happen if two processors both read miss concurrently.
- P1 sends a READ request to the home node.
- Concurrently, P2 sends a READ request to the home node.
- The home node receives the request from P1.
- The home node sets its state to EM, its head pointer to 1 (P1), and sends the data to P1.
- The home node receives the request from P2.
- The home node sends a ReplyID to P2, informing it that P1 "owns" the data.
- P2 receives the ReplyID and sends WB+Int+UpdPtr to P1.
- P1 receives the message from P2
- ERROR! P1 hasn't yet received the ReplyD from the home directory!
Due to network latency, it's possible that network messages are received "out of order," causing race conditions such as the one above.
Now let's consider the same scenario in SCI.
P1 and P2 both read miss. Both set aside space for the line to be read in, set the state of that line to a "pending" state, and issue the read request to the home node. When the home node receives the request from P1, the memory directory will be in state HOME (since the block is not cached anywhere). It will set its state to FRESH, set its head pointer to 1, and reply with the data.
Now P2's request is processed by the home node. Since the home node is in state FRESH, its data is up to date. It sets its head pointer to 2 and replies to P2 with the data, and the "old" head pointer (P1).
P2 receives the data, but the action is not complete until it receives "permission" from P1 to become the new head of the list. So, P2 remains in a pending state and sends a request to P1 to become the new head of the list. Recall that in our scenario, due to network latency, P1 still hasn't received the initial response from the home directory; P1 is still in a pending state. Under the SCI protocol, if a processor is in a pending state when it receives a list operation request (such as becoming the new head of the list), it will extend the pending list backwards. That is, P2 will be put at the head of the list, but in a "pending" state. When P1 completes its operation and becomes the "true" head of the list, it can then pass the "true head" status on to P2. In this way, read requests are atomic and we avoid the race condition sufferred under the SSCI protocol (albeit by adding more complexity).
Example 2
Suppose some block of data is being shared among several caches. Our linked list of sharers is P1-P2-P3-P4-P5. P3 needs to purge the line containing that block of data.
Under the SSCI protocol, P3 will send an UpdPtr message to the processors represented by the previous and next pointers (P2 and P4 respectively). P2 will set its next pointer to P4, and P4 will set its next pointer to P2.
Consider what might happen if P3 and P4 concurrently purge their lines of shared data.
- P3 sends UpdPtr(P3->P2) and UpdPtr(P3->P4) and purges its data
- Concurrently, P4 sends UpdPtr(P4->P3) and UpdPtr(P4->P5) and purges its data
- P2 received the UpdPtr from P3 and sets its "next" pointer to P4
- ERROR! P4 does not have the line cached!
Since P4 has already purged and is no longer in the list of sharers, we have a race condition.
Now let's examine the same scenario under the SCI protocol. In the SCI protocol, a node that wants to "roll out" is put in a pending state until it receives acknowledgment from both the next and previous nodes.
- P3 and P4 both put themselves in a pending state.
- P3 sends UpdPtr(P3->P2) and UpdPtr(P3->P4)
- Concurrently, P4 sends UpdPtr(P4->P3) and UpdPtr(P4->P5)
- Deadlock! Neither P3 nor P4 will respond to the other while in a pending state.
- Resolution: the node nearer to the tail of the list (P4) is given priority.
- P3 must wait until P4 completes its rollout, giving a shared list of P1-P2-P3-P5.
- P3 rolls out.
So, pending states are necessary to preserve a correct representation of the shared list, but even then we must deal with deadlock resolution.
Links
http://www.scizzl.com/WhatIsSCI.html
http://www.lrr.in.tum.de/~gerndt/home/Teaching/scalable_shared_memory_systems/Kapitel8.pdf