CSC/ECE 506 Fall 2007/wiki3 2 aY3w
SCI. The IEEE Scalable Coherent Interface is a superset of the SSCI protocol we have been considering in class. A lot has been written about it, but it is still difficult to comprehend. Using SSCI as a starting point, explain why additional states are necessary, and give (or cite) examples that demonstrate how they work. Ideally, this would still be an overview of the working of the protocol, referencing more detailed documentation on the Web.
A Brief Overview of the SSCI Protocol
The Simple Scalable Coherence Interface (SSCI) protocol is a simplified version of the SCI protocol. While the SCI protocol has 29 stable states and many pending states, the SSCI protocol has only 3 states. Consequently, many race conditions are possible. Here we give a brief overview of the SSCI protocol, and elaborate on some possible race conditions in the final section.
The caches themselves follow the MESI protocol, and thus are in one of those four states. Each block of cached data has a list of sharers, or processors that have that block cached. This list is represented as a doubly linked list that is distributed among the sharers themselves.
In addition to the MESI state, each cache maintains a "next" pointer and a "previous" pointer. Each pointer points to another node in the list of sharers of the cached block of data. The head of the list will have a previous pointer value of 0, and the tail of the list will have a next pointer value of 0.
Each block of data has a "home directory," located in some (possibly remote) memory location. The home directory is normally computable by the memory address of the block of data. If the block is cached, the home directory will contain the head pointer of the list of sharers. Additionally, the home directory will contain state information for the block.
As previously stated, in the SSCI protocol the memory directory is in one of three states. They are:
- UNOWNED - the block is unowned by any cache.
- SHARED - the block is owned by multiple caches. Main memory is up to date.
- EXCLUSIVE/MODIFIED - the (possibly dirty) block is owned by a single cache. Main memory is possibly out of date.
A Brief Overview of the SCI Protocol
In this section we give a very brief overview of the SCI protocol. This is not meant to be a complete description, but to give enough information to illustrate some key differences in the SSCI and SCI protocols.
Memory Directory States
Similar to the SSCI protocol, in SCI the memory directory is one of three states:
- HOME - no remote cache contains a copy of the block.
- FRESH - one of more remote caches may have a read-only copy of the block. Main memory is up to date.
- GONE - the cache at the head of the sharing list contains an exclusive (possibly dirty) copy of the block.
Cache States
Additionally, each cache has a state, either stable or pending. Stable states have two parts:
- Where the cache entry is located in the sharing list for that block
- ONLY - singleton list
- HEAD - head of sharing list of more than one node
- MID - not head or tail of sharing list of more than one node
- TAIL - tail of sharing list of more than one node
- State of the cached block
- DIRTY - modified and writable
- CLEAN - unmodified and writable, same contents as memory
- FRESH - data is readable and writable, but memory must be informed before memory is written
- COPY - unmodified and readable
- many others
These two parts together describe the cache state, e.g. FRESH-ONLY or MID-VALID.
List Operations
SCI defines three primitive list operations for shared lists:
- List construction - adding a new node to the head of the sharing list
- Rollout - removing a node from the sharing list
- Invalidation - invalidate all other sharers. Can only be performed by the head node.
The head of the list of sharers can be thought of as the owner. Only the head can modify the cached data. If some other sharer wants to modify the data, it must first rollout of the list and reattach as the head.
Why Additional States are Needed
The principle difference between SCI and SSCI is that SCI provides mechanisms for resolving race conditions and preserving serialization.
Example 1
Suppose some processor, say P1, issues a read request and misses on some block that is not cached by any processor. So, the block will have to be fetched from main memory (or possibly some remote cache), and the line will be put into P1's local cache. Space will have to be allocated in P1's cache, usually by evicting some other line.
In the SSCI protocol, P1 will send a READ message to the home directory. The home directory will reply with the data and set its state to EM, and its head pointer to 1. Upon receipt, P1 will put the block in the allocated line, set its state to E (from the MESI protocol), and its next/previous pointers to 0.
Consider what might happen if two processors both read miss concurrently.
- P1 sends a READ request to the home node.
- Concurrently, P2 sends a READ request to the home node.
- The home node receives the request from P1.
- The home node sets its state to EM, its head pointer to 1 (P1), and sends the data to P1.
- The home node receives the request from P2.
- The home node sends a ReplyID to P2, informing it that P1 "owns" the data.
- P2 receives the ReplyID and sends WB+Int+UpdPtr to P1.
- P1 receives the message from P2
- ERROR! P1 hasn't yet received the ReplyD from the home directory!
Due to network latency, it's possible that network messages are received "out of order," causing race conditions such as the one above.
Now let's consider the same scenario in SCI.
- P1 and P2 both read miss.
- P1 and P2 set aside space for the line to be read in (possibly by evicting another line).
- P1 and P2 concurrently set the state of that line to a "pending" state, and issue a read request to the home node.
- The home node receives the request from P1. Since the block is not cached anywhere, the memory directory will be in state HOME.
- The home node sets its state to FRESH, its head pointer to 1, and replies to P1 with the data.
- The home node receives P2's READ. Since the home node is in state FRESH, its data is up to date.
- The home node sets its head pointer to 2 and replies to P2 with the data, and the "old" head pointer (P1).
- P2 receives the data, but the action is not complete until it receives "permission" from P1 to become the new head of the list.
- P2 remains in a pending state and sends a request to P1 to become the new head of the list.
- P1 receives the request from P2, but is since it hasn't yet received the reply from the home directory it is still in a pending state.
- P1 "extends" the pending list backwards. That is, P2 will be put at the head of the list, but in a "pending" state.
- P1 receives the reply from the home directory and becomes the "true" head of the list. Even though P2 is ahead of P1 in the list, it's not the "true head," but a part of the "pending list."
- P1 passes the "true head" status on to P2.
By maintaining a "pending list" and a "true head", read requests are atomic and we avoid the race condition sufferred under the SSCI protocol (albeit by adding more complexity).
Example 2
Suppose some block of data is being shared among several caches. Our linked list of sharers is P1-P2-P3-P4-P5. P3 needs to purge the line containing that block of data.
Under the SSCI protocol, P3 will send an UpdPtr message to the processors represented by the previous and next pointers (P2 and P4 respectively). P2 will set its next pointer to P4, and P4 will set its next pointer to P2.
Consider what might happen if P3 and P4 concurrently purge their lines of shared data.
- P3 sends UpdPtr(P3->P2) and UpdPtr(P3->P4) and purges its data
- Concurrently, P4 sends UpdPtr(P4->P3) and UpdPtr(P4->P5) and purges its data
- P2 received the UpdPtr from P3 and sets its "next" pointer to P4
- ERROR! P4 does not have the line cached!
Since P4 has already purged and is no longer in the list of sharers, we have a race condition.
Now let's examine the same scenario under the SCI protocol. In the SCI protocol, a node that wants to "roll out" is put in a pending state until it receives acknowledgment from both the next and previous nodes.
- P3 and P4 both put themselves in a pending state.
- P3 sends UpdPtr(P3->P2) and UpdPtr(P3->P4)
- Concurrently, P4 sends UpdPtr(P4->P3) and UpdPtr(P4->P5)
- Deadlock! Neither P3 nor P4 will respond to the other while in a pending state.
- Resolution: the node nearer to the tail of the list (P4) is given priority.
- P3 must wait until P4 completes its rollout, giving a shared list of P1-P2-P3-P5.
- P3 rolls out.
So, pending states are necessary to preserve a correct representation of the shared list, but even then we must deal with deadlock resolution.
Links
http://www.scizzl.com/WhatIsSCI.html
http://www.lrr.in.tum.de/~gerndt/home/Teaching/scalable_shared_memory_systems/Kapitel8.pdf
http://csdl2.computer.org/persagen/DLAbsToc.jsp?resourcePath=/dl/mags/mi/&toc=comp/mags/mi/1992/01/m1toc.xml&DOI=10.1109/40.124376
http://www.hcs.ufl.edu/SCI_Internal/