Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-23T20:40:41Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[definitions]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority (IEEE pg156)[[#1foot|[1]]]. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

==== Simultaneous Deletion and Invalidation ====
Another potential race condition occurs when the ''Head'' node tries to purge the list of sharers at the same time as a node in the list tries to remove itself from the list. Suppose that the ''Head'' node is trying to purge node ''A'', which has node ''B'' as its next node in the list, while at the same time, node ''B'' is trying to remove itself from the list. Node ''B'' needs to tell node ''A'' to update its next pointer, but the ''Head'' node will need to access node ''A'''s next pointer to know what the next node in the list is. An illustration is shown below. The ''1'' and the ''2'' represent the messages being sent from the respective node to node ''A'' that have not yet arrived.
[[Image:PurgeDelete.png|center]]
What will happen is that one of the messages will reach node ''A'' first. While ''A'' is responding to this message, it will tell other nodes that it is busy, in a similar manner to the notion of the atomic transactions. If node ''B'''s message arrives first, causing the message from the ''Head'' node to be rejected. Node ''A'' will have its forward pointer changed at this point, so when the ''Head'' node resends its request, all will work as expected.
The second option is more difficult. If the ''Head'' node's message arrives first, node ''A'' will be invalidated, so that when the message from node ''B'' is sent again, node ''A'' will simply inform node ''B'' that it is no longer part of the list. As well, the ''Head'' node will now have its forward pointer pointing at node ''B''. Node ''A'' has thus been successfully purged, but we arrive at an impasse, as now the ''Head'' node will try to invalidate the locked node ''B'', while node ''B'' will try to tell the ''Head'' node to change its forward pointer, with both nodes being inside a transaction. This situation is illustrated by the following figure.
[[Image:PurgeDelete2.png|center]]
The protocol does not specify how to handle this situation, but it would make sense for the protocol to allow one of the messages through, presumably the ''Head'' node's message. Node ''B'' could simply respond in the affirmative, telling the ''Head'' node what the following node is, but this would only work if node ''B'' notifies the next node of the deletion after it notifies node ''A''. Otherwise, the node following ''B'' would have its back pointer pointing to the wrong node. The protocol standard seems to suggest that this is the case (IEEE, pg162)[[#1foot|[1]]].

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-19T03:53:53Z

Beburrou: /* Concurrent List Deletions */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority (IEEE pg156)[[#1foot|[1]]]. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

==== Simultaneous Deletion and Invalidation ====
Another potential race condition occurs when the ''Head'' node tries to purge the list of sharers at the same time as a node in the list tries to remove itself from the list. Suppose that the ''Head'' node is trying to purge node ''A'', which has node ''B'' as its next node in the list, while at the same time, node ''B'' is trying to remove itself from the list. Node ''B'' needs to tell node ''A'' to update its next pointer, but the ''Head'' node will need to access node ''A'''s next pointer to know what the next node in the list is. An illustration is shown below. The ''1'' and the ''2'' represent the messages being sent from the respective node to node ''A'' that have not yet arrived.
[[Image:PurgeDelete.png|center]]
What will happen is that one of the messages will reach node ''A'' first. While ''A'' is responding to this message, it will tell other nodes that it is busy, in a similar manner to the notion of the atomic transactions. If node ''B'''s message arrives first, causing the message from the ''Head'' node to be rejected. Node ''A'' will have its forward pointer changed at this point, so when the ''Head'' node resends its request, all will work as expected.
The second option is more difficult. If the ''Head'' node's message arrives first, node ''A'' will be invalidated, so that when the message from node ''B'' is sent again, node ''A'' will simply inform node ''B'' that it is no longer part of the list. As well, the ''Head'' node will now have its forward pointer pointing at node ''B''. Node ''A'' has thus been successfully purged, but we arrive at an impasse, as now the ''Head'' node will try to invalidate the locked node ''B'', while node ''B'' will try to tell the ''Head'' node to change its forward pointer, with both nodes being inside a transaction. This situation is illustrated by the following figure.
[[Image:PurgeDelete2.png|center]]
The protocol does not specify how to handle this situation, but it would make sense for the protocol to allow one of the messages through, presumably the ''Head'' node's message. Node ''B'' could simply respond in the affirmative, telling the ''Head'' node what the following node is, but this would only work if node ''B'' notifies the next node of the deletion after it notifies node ''A''. Otherwise, the node following ''B'' would have its back pointer pointing to the wrong node. The protocol standard seems to suggest that this is the case (IEEE, pg162)[[#1foot|[1]]].

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-19T03:50:53Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

==== Simultaneous Deletion and Invalidation ====
Another potential race condition occurs when the ''Head'' node tries to purge the list of sharers at the same time as a node in the list tries to remove itself from the list. Suppose that the ''Head'' node is trying to purge node ''A'', which has node ''B'' as its next node in the list, while at the same time, node ''B'' is trying to remove itself from the list. Node ''B'' needs to tell node ''A'' to update its next pointer, but the ''Head'' node will need to access node ''A'''s next pointer to know what the next node in the list is. An illustration is shown below. The ''1'' and the ''2'' represent the messages being sent from the respective node to node ''A'' that have not yet arrived.
[[Image:PurgeDelete.png|center]]
What will happen is that one of the messages will reach node ''A'' first. While ''A'' is responding to this message, it will tell other nodes that it is busy, in a similar manner to the notion of the atomic transactions. If node ''B'''s message arrives first, causing the message from the ''Head'' node to be rejected. Node ''A'' will have its forward pointer changed at this point, so when the ''Head'' node resends its request, all will work as expected.
The second option is more difficult. If the ''Head'' node's message arrives first, node ''A'' will be invalidated, so that when the message from node ''B'' is sent again, node ''A'' will simply inform node ''B'' that it is no longer part of the list. As well, the ''Head'' node will now have its forward pointer pointing at node ''B''. Node ''A'' has thus been successfully purged, but we arrive at an impasse, as now the ''Head'' node will try to invalidate the locked node ''B'', while node ''B'' will try to tell the ''Head'' node to change its forward pointer, with both nodes being inside a transaction. This situation is illustrated by the following figure.
[[Image:PurgeDelete2.png|center]]
The protocol does not specify how to handle this situation, but it would make sense for the protocol to allow one of the messages through, presumably the ''Head'' node's message. Node ''B'' could simply respond in the affirmative, telling the ''Head'' node what the following node is, but this would only work if node ''B'' notifies the next node of the deletion after it notifies node ''A''. Otherwise, the node following ''B'' would have its back pointer pointing to the wrong node. The protocol standard seems to suggest that this is the case (IEEE, pg162)[[#1foot|[1]]].

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-19T03:49:30Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

==== Simultaneous Deletion and Invalidation ====
Another potential race condition occurs when the ''Head'' node tries to purge the list of sharers at the same time as a node in the list tries to remove itself from the list. Suppose that the ''Head'' node is trying to purge node ''A'', which has node ''B'' as its next node in the list, while at the same time, node ''B'' is trying to remove itself from the list. Node ''B'' needs to tell node ''A'' to update its next pointer, but the ''Head'' node will need to access node ''A'''s next pointer to know what the next node in the list is. An illustration is shown below. The ''1'' and the ''2'' represent the messages being sent from the respective node to node ''A'' that have not yet arrived.
[[Image:PurgeDelete.png|center]]
What will happen is that one of the messages will reach node ''A'' first. While ''A'' is responding to this message, it will tell other nodes that it is busy, in a similar manner to the notion of the atomic transactions. If node ''B'''s message arrives first, causing the message from the ''Head'' node to be rejected. Node ''A'' will have its forward pointer changed at this point, so when the ''Head'' node resends its request, all will work as expected.
The second option is more difficult. If the ''Head'' node's message arrives first, node ''A'' will be invalidated, so that when the message from node ''B'' is sent again, node ''A'' will simply inform node ''B'' that it is no longer part of the list. As well, the ''Head'' node will now have its forward pointer pointing at node ''B''. Node ''A'' has thus been successfully purged, but we arrive at an impasse, as now the ''Head'' node will try to invalidate the locked node ''B'', while node ''B'' will try to tell the ''Head'' node to change its forward pointer, with both nodes being inside a transaction. This situation is illustrated by the following figure.
[[Image:PurgeDelete2.png|center]]
The protocol does not specify how to handle this situation, but it would make sense for the protocol to allow one of the messages through, presumably the ''Head'' node's message. Node ''B'' could simply respond in the affirmative, telling the ''Head'' node what the following node is, but this would only work if node ''B'' notifies the next node of the deletion after it notifies node ''A''. Otherwise, the node following ''B'' would have its back pointer pointing to the wrong node. The protocol standard seems to suggest that this is the case (IEEE, pg162).

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

File:PurgeDelete2.png

2011-04-19T03:30:10Z

Beburrou:

File:PurgeDelete.png

2011-04-19T03:17:22Z

Beburrou:

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-19T02:06:03Z

Beburrou: /* Concurrent List Deletions */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

==== Simultaneous Deletion and Invalidation ====

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-19T02:05:55Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled "IEEE Standard for Scalable Coherent Interface (SCI)"[[#References|[1]]] .)

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors. Likewise, these memory states are impacted when these memory transactions occur.

Below is a state diagram that depicts the transition between these memory states:

[[Image:memory_state.png|center]]

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

====States of the Typical Set====

Following are the states defined for the Typical set. (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)

* '''ONLY_DIRTY''' - only one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.
* '''ONLY_FRESH''' - only one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache. This block is writable, and the processor has written (or intends) to write to it. This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache. This block is writeable, but processor has not written to it. This state is set when the processor requests the block with read privileges, and another processors already caches the block.
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is neither the Head or the Tail of the of the sharing list.
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable. The processor cache with this state is the Tail of the of the sharing list.

Below is a state diagram that depicts these states:

[[Image:cache_states.png|center]]

The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements. For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions.

The first value indicates the request to the directory by a node for memory access:
* '''Fetch R''' - indicates a request for a memory block with read privileges.
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.

The value in the parenthesis indicates the memory state at the time of the request.

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T23:07:23Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another ''PENDING'' state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T23:06:18Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T23:05:48Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[[#Definitions_and_Terms|[def]]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:59:46Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2011/ch11_BB_EP#Definitions_and_Terms [def]] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:53:17Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING''[http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2011/ch11_BB_EP#Definitions_and_Terms [def] ] state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:51:55Z

Beburrou: /* Definitions and Terms */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING'' state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list. This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:50:07Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING'' state, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:49:28Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' will be in the ''PENDING'' state, as described above, so any requests to demote it from the ''Head'' position will be delayed.. This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:47:50Z

Beburrou: /* Atomic Transactions */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache. This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response. The transaction is not complete until the response returns. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:40:18Z

Beburrou: /* Simultaneous Deletion and Invalidation */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache. This is referred to in the SCI standard as a "deletion" from the sharing list. Deletion is accomplished by having the invalidating node "lock" itself and then inform its forward and back nodes that they should now point to each other. This "locking" is essentially another busy state. A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message. In this case, though, the protocol specifies that the node that is closest to the tail takes priority. So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T22:35:38Z

Beburrou: /* Other Possible Race Conditions */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown. Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T20:21:50Z

Beburrou: /* Summary */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T20:07:27Z

Beburrou: /* References */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049 
[[#2body|2.]] [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works]

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:26:30Z

Beburrou: /* Definitions and Terms */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==
* ''Head Node'' - The node at the beginning of the sharing list
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.
* ''SCI'' - Scalable Coherent Interface

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:23:45Z

Beburrou: /* References */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[[#1body|1.]] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:20:37Z

Beburrou: /* Possible Race Conditions */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Other Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:17:35Z

Beburrou: /* Atomic Transactions */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it. Note, however, that nodes will not stay in the busy state indefinitely. Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:15:42Z

Beburrou: /* Head Node */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write. All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:11:29Z

Beburrou: /* Solihin 11.4 Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected. This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.

=== Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T19:08:01Z

Beburrou: /* Possible Race Conditions */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Memory States
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

=== Possible Race Conditions ===
==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T04:31:02Z

Beburrou: /* Example Resolution */

__TOC__

== History ==

== Background ==
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.

Three sets of these attributes are defined for minimal, typical, and full applications:

* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line. It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.

* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery. This set adds an additional stable memory state (FRESH) and multiple cache states. This set will be the focus of this article going forward.

* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.

== State Diagrams ==
=== Memory States ===
The memory states define the state of the memory block from the perspective of the home directory. This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId). This simple state model includes three stable states and one semi-stable state.

{|border="1" cellspacing="0" cellpadding="2"
!colspan="5"| Table 1: Recent Architectures and their Cache Characteristics
|----
!State
!Description
!Minimal
!Typical
!Full
|----
|HOME
|no sharing list
|Y
|Y
|Y
|----
|FRESH
|sharing-list copy is the same as memory
|
|Y
|Y
|----
|GONE
|sharing-list copy may be different from memory
|Y
|Y
|Y
|----
|WASH*
|transitional state (GONE to FRESH)
|
|
|Y
|----
|}

=== Cache States ===
The cache-line states are maintained by each processors cache-coherency controller. This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID). Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.

Stable states are those cache states that exist when a memory transaction is not in process. Their names are derived from a combination of
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.

[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order. As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted. All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.

Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur. This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node. One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node. Such a scenario is pictured below.
[[Image:MemoryAccess.png|center]]

==== Solihin 11.4 Resolution ====
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions. Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.
[[Image:SolihinSCI.png|center]]

# ''A'' sends a request to ''Home'' for access to the memory block. It then goes into a busy state while it waits for a response.
# ''B'' also sends a request to ''Home'' for access to the same memory block. ''A'''s request is received first
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed. This response is sent before ''Home'' processes the request from ''B''.
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.
# ''B'' then sends a request to ''A'' to tell it to demote itself.
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy. ''B'' will have to retry the request.

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==
[1] "IEEE Standard for Scalable Coherent Interface (SCI).," IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&arnumber=347683&isnumber=8049

File:SolihinSCI.png

2011-04-18T04:21:22Z

Beburrou:

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T04:05:55Z

Beburrou: /* Background */

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T04:03:29Z

Beburrou: /* = Example Resolution */

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T04:03:18Z

Beburrou: /* Prevention in the SCI Protocol */

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T04:01:55Z

Beburrou: /* Memory Access */

File:MemoryAccess.png

2011-04-18T04:00:03Z

Beburrou: Diagram of memory access race

Diagram of memory access race

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:39:37Z

Beburrou: /* Memory Access */

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory States ===

=== Cache States ===
[[Image:cache_states.png|center]]

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====
The final way that the SCI protocol minimizes race conditions is by changing the directory structure. In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result. In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is. Such a ''Home'' node is usually the node where the memory block physically resides. When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block. This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:05:25Z

Beburrou: /* Head Node */

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:04:49Z

Beburrou: /* Head Node */

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'' but the Head Node wants to write, but another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:03:58Z

Beburrou: /* Head Node */

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty or ''GONE'' memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.

This, however, does not protect against a race condition where the memory is ''FRESH'' but the Head Node wants to write, but another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:02:35Z

Beburrou: /* Head Node */

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.

When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. If any node wants to write, including the Head Node, it must perform an additional action in order to do so. Likewise, when a sharing list is sharing a line of data in the dirty or ''GONE'' memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list. This, however, does not protect against a race condition where the memory is ''FRESH'' but the Head Node wants to write, but another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T03:00:52Z

Beburrou: /* Race Conditions */ first submisson

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely. This is mainly due to the lack of a bus for serialization of actions. It is further compounded by the problem of network errors and congestion.

The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system. Recall the diagram from the text, and the cache coherence actions, as shown below.

[[Image:EarlyInValidationRace.png|400px|center]]
The circled actions are as follows:
# ''A'' sends a read request to ''Home''.
# ''Home'' replies with data (but the message gets delayed).
# ''B'' sends a write request to ''Home''.
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD

The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.

=== Prevention in the SCI Protocol ===
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design. A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.
==== Atomic Transactions ====
SCI's primary method for preventing race conditions is having atomic transactions. A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable. Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''. Node ''C'' then tries to make a request of node ''A''. Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.
[[Image:AtomicBusy.png|center]]
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.

==== Head Node ====
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions. As a result, only one node is performing actions such as writes and invalidations of other sharers. Since only one node is performing these actions, the possibility of concurrent actions is decreased. If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant. All the sharing nodes have their own cached value of the cache line. Likewise, when a sharing list is sharing a line of data in the dirty or ''GONE'' memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line. However, at the point the Head Node wants to write, it can write it's data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list. This, however, does not protect against a race condition where the memory is ''FRESH'' but the Head Node wants to write, but another node also wants to join the list. The Memory Access mechanism prevents this condition.

==== Memory Access ====

=== Possible Race Conditions ===
==== Communication Delays ====

==== Concurrent List Deletions ====

==== Simultaneous Deletion and Invalidation ====

== Summary ==

== Definitions and Terms ==

== References ==

File:AtomicBusy.png

2011-04-18T02:30:20Z

Beburrou: Diagram of how an atomic transaction blocks another node's request

Diagram of how an atomic transaction blocks another node's request

File:EarlyInValidationRace.png

2011-04-18T01:38:04Z

Beburrou: Early Invalidation Race Condition example from Section 11.4 of Yan Solihin's [i]Fundamentals of Parallel Computer Architecture[/i], 2008 1. A sends a read request to home. 2. Home replies with data (but the message gets delayed). 3. B sends a write reque

Early Invalidation Race Condition example from Section 11.4 of Yan Solihin's [i]Fundamentals of Parallel Computer Architecture[/i], 2008

1. A sends a read request to home.
2. Home replies with data (but the message gets delayed).
3. B sends a write request to home.
4. Home sends invalidation to A, and it arrives before the ReplyD

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-18T01:35:35Z

Beburrou: /* Race Conditions */

CSC/ECE 506 Spring 2011/ch11 BB EP

2011-04-17T23:59:20Z

Beburrou:

__TOC__

== History ==

== Background ==

== State Diagrams ==
=== Memory ===

=== Processors ===

== Race Conditions ==

== Summary ==

== Definitions and Terms ==

== References ==

CSC/ECE 506 Spring 2011/ch4a bm

2011-02-28T05:09:44Z

Beburrou: /* Shared Memory */

= Overview =
Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.

= Gaussian Elimination =
Gaussian Elimination is a common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.[[#5foot|[5]]] The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.

'''Forward Reduction'''

The first step is to reduce the equation matrix to [http://en.wikipedia.org/wiki/Echelon_form row-echelon form]. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of examples will help to illustrate row-echelon form:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 7
| 3
| 0
|
| -4
|-
| 0
| 0
| 1
| 10
| =
| 0.5
|-
| 0
| 0
| 0
| 1
|
| 6
|}
</blockquote>

'''Back Substitution'''

In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element in equation 2, and subtract 3 from both sides:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element and 2 for the second element in equation 1, and solve:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 0
| 0
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

= FORTRAN Background =
The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array <code>A</code>
* Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
* All elements of an array can be set to a value by simply setting the array variable equal to a value, as in <code>A = 0</code>
* A <code>DO</code> loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as in <code>A(a:b) = A(a:b) * 2</code>
* In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in <code>A(1, :) = 2</code>

= Parallel Implementations =

== Data Parallel ==

The following section of code implements Gaussian Elimination via data parallel with HPF. It is found in the book ''Designing and Building Parallel Programs (Online)'' by Ian Foster.[[#2foot|[2]]] 

1: subroutine gauss(n, A, X)
2: integer n
3: real A(n, n+1)
5: real X(n), Fac(n), Row(n+1)
6: integer Indx(n), Itmp(1)
7: integer i, j, k, max_indx
8: real maxval
9:
10: Indx = 0 ! Initialize mask array
11: do i = 1, n
12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
13: max_indx = Itmp(1) ! Extract pivot index
14: Indx(max_indx) = i ! Update indirection array
15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column
16: Row = A(max_indx,:) ! Extract pivot row
17:
18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
19: A(j,k) = A(j,k) - Fac(j) * Row(k)
20: end do
21:
22: FORALL (j=1:n)
23: A(Indx(j),:) = A(j,:) ! Row exchange
24:
25: DO j = n,1,-1 ! Back substitution
26: X(j) = A(j,n+1) / A(j,j)
27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
28: ENDDO
29: end sub

 

This data parallel code works by performing operations on entire rows of the equation matrix. In the first loop, the forward reduction is performed. The variable <code>i</code> is an index which keeps track of the column (and thus row) that is currently being reduced.

Instead of reducing the rows starting with row 1, we will instead work with the row that has the largest element in the pivot column that hasn't already been used. We will find the row with which to work at the beginning of each iteration. (lines 12-14)

Next, we divide every element in the current row by the value in the column we are reducing. (lines 15-16)

To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0). This part of the algorithm is extremely parallelizable. The <code>FORALL</code> statement tells the compiler that this statement is parallelizable, and this statement implies data parallelism. (lines 18-20)

Lines 22-24 exchange rows to put the matrix in row-echelon form, and the <code>FORALL</code> statement again shows that this operation is highly parallelizable.

The last loop does the back substitution. The loop starts on the end and works its way to the first row.

First, the last row is solved. (line 26)
Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)

== Shared Memory ==

The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: do pivot = 1, (n-1)
2: !$omp parallel do private(xmult) schedule(runtime)
3: do i = (pivot+1), n
4: xmult = a(i,pivot) / a(pivot,pivot)
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
8: b(i) = b(i) - (xmult * b(pivot))
9: end do
10: 1: !$omp end parallel do
11: end do
 

As is readily seen, this code is short and simple, so we will analyze it line by line. To summarize the method, each row is solved serially, with the columns each being normalized and subtracted from the remaining rows in parallel.

The pivot column is the column currently being reduced (and subsequently, also the row).

1: do pivot = 1, (n-1)
Loop through each row for forward reduction.
2: !$omp parallel do private(xmult) schedule(runtime)
Spawn parallel threads here, making <code>xmult</code> private.
3: do i = (pivot+1), n
Loop through all columns in the current row starting from the pivot.
4: xmult = a(i,pivot) / a(pivot,pivot)
Divide the i'th column by the pivot column. This is the normalization step.
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
Now, we loop through all other rows and update them with the just normalized column. This involves subtracting it the needed number of times.
8: b(i) = b(i) - (xmult * b(pivot))
We also update the solution column, which is in the <code>b</code> array.
9: end do
10: 1: !$omp end parallel do
11: end do
Back substitution would be done next, but is not shown.

== Message Passing ==
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: root = 0
2: chunk = n**2/p
3: ! main loop
4: do pivot = 1, n-1
5: ! root maintains communication
6: if (my_rank.eq.0) then
7: ! adjust the chunk size
8: if (MOD(pivot, p).eq.0) then
9: chunk = chunk - n
10: endif
11:
12: ! calculate chunk vectors
13: rem = MOD((n**2-(n*pivot)),chunk)
14: tmp = 0
15: do i = 1, p
16: tmp = tmp + chunk
17: if (tmp.le.(n**2-(n*pivot))) then
18: a_chnk_vec(i) = chunk
19: b_chnk_vec(i) = chunk / n
20: else
21: a_chnk_vec(i) = rem
22: b_chnk_vec(i) = rem / n
23: rem = 0
24: endif
25: continue
26:
27: ! calculate displacement vectors
28: a_disp_vec(1) = (pivot*n)
29: b_disp_vec(1) = pivot
30: do i = 2, p
31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33: continue
34:
35: ! fetch the pivot equation
36: do i = 1, n
37: pivot_eqn(i) = a(n-(i-1),pivot)
38: continue
39:
40: pivot_b = b(pivot)
41: endif ! my_rank.eq.0
42:
43: ! distribute the pivot equation
44: call MPI_BCAST(pivot_eqn, n,
45: MPI_DOUBLE_PRECISION,
46: root, MPI_COMM_WORLD, ierr)
47:
48: call MPI_BCAST(pivot_b, 1,
49: MPI_DOUBLE_PRECISION,
50: root, MPI_COMM_WORLD, ierr)
51:
52: ! distribute the chunk vector
53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54: chunk, 1, MPI_INTEGER,
55: root, MPI_COMM_WORLD, ierr)
56:
57: ! distribute the data
58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59: MPI_DOUBLE_PRECISION,
60: local_a, chunk,
61: MPI_DOUBLE_PRECISION,
62: root, MPI_COMM_WORLD,ierr)
63:
64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65: MPI_DOUBLE_PRECISION,
66: local_b, chunk/n,
67: MPI_DOUBLE_PRECISION,
68: root, MPI_COMM_WORLD,ierr)
69:
70: ! forward elimination
71: do j = 1, (chunk/n)
72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73: do i = (n-pivot), 1, -1
74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75: continue
76:
77: local_b(j) = local_b(j) - (xmult * pivot_b)
78: continue
79:
80: ! restore the data to root
81: call MPI_GATHERV(local_a, chunk,
82: MPI_DOUBLE_PRECISION,
83: a, a_chnk_vec, a_disp_vec,
84: MPI_DOUBLE_PRECISION,
85: root, MPI_COMM_WORLD, ierr)
86:
87: call MPI_GATHERV(local_b, chunk/n,
88: MPI_DOUBLE_PRECISION,
89: b, b_chnk_vec, b_disp_vec,
90: MPI_DOUBLE_PRECISION,
91: root, MPI_COMM_WORLD, ierr)
92: continue ! end of main loop
93:
94: ! backwards substitution done in parallel (not shown)
 
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the <code>chunk</code> variable. <code>chunk</code> is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making <code>chunk</code> smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is <code>root</code>, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions <code>MPI_BCAST</code>, <code>MPI_SCATTER</code>, and <code>MPI_SCATTERV</code> serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[[#3foot|[3]]]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the <code>MPI_GATHERV</code> function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the <code>MPI_</code> functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

= Definitions =
* ''HPF'' - High Performance FORTRAN
* ''MPI'' - Message Passing Interface, an API used for supporting message passing across processes.

= References =
[[#1body|1.]] S.F.McGinn and R.E.Shaw, University of New Brunswick, [http://hpds.ee.kuas.edu.tw/download/parallel_processing/96/96present/20071212/Gaussian.pdf Parallel Gaussian Elimination Using OpenMP and MPI] 
[[#2body|2.]] Ian Foster, Argonne National Laboratory, [http://www.mcs.anl.gov/~itf/dbpp/text/node90.html Case Study: Gaussian Elimination] 
[[#3body|3.]] [http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html MPI: A Message-Passing Interface Standard] 
[[#4body|4.]] Wikipedia's [http://en.wikipedia.org/wiki/Fortran FORTRAN] page 
[[#5body|5.]] Wikipedia's [http://en.wikipedia.org/wiki/Gaussian_elimination Gaussian Elimination] page

CSC/ECE 506 Spring 2011/ch4a bm

2011-02-28T05:08:02Z

Beburrou: /* Data Parallel */

= Overview =
Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.

= Gaussian Elimination =
Gaussian Elimination is a common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.[[#5foot|[5]]] The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.

'''Forward Reduction'''

The first step is to reduce the equation matrix to [http://en.wikipedia.org/wiki/Echelon_form row-echelon form]. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of examples will help to illustrate row-echelon form:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 7
| 3
| 0
|
| -4
|-
| 0
| 0
| 1
| 10
| =
| 0.5
|-
| 0
| 0
| 0
| 1
|
| 6
|}
</blockquote>

'''Back Substitution'''

In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element in equation 2, and subtract 3 from both sides:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element and 2 for the second element in equation 1, and solve:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 0
| 0
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

= FORTRAN Background =
The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array <code>A</code>
* Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
* All elements of an array can be set to a value by simply setting the array variable equal to a value, as in <code>A = 0</code>
* A <code>DO</code> loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as in <code>A(a:b) = A(a:b) * 2</code>
* In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in <code>A(1, :) = 2</code>

= Parallel Implementations =

== Data Parallel ==

The following section of code implements Gaussian Elimination via data parallel with HPF. It is found in the book ''Designing and Building Parallel Programs (Online)'' by Ian Foster.[[#2foot|[2]]] 

1: subroutine gauss(n, A, X)
2: integer n
3: real A(n, n+1)
5: real X(n), Fac(n), Row(n+1)
6: integer Indx(n), Itmp(1)
7: integer i, j, k, max_indx
8: real maxval
9:
10: Indx = 0 ! Initialize mask array
11: do i = 1, n
12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
13: max_indx = Itmp(1) ! Extract pivot index
14: Indx(max_indx) = i ! Update indirection array
15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column
16: Row = A(max_indx,:) ! Extract pivot row
17:
18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
19: A(j,k) = A(j,k) - Fac(j) * Row(k)
20: end do
21:
22: FORALL (j=1:n)
23: A(Indx(j),:) = A(j,:) ! Row exchange
24:
25: DO j = n,1,-1 ! Back substitution
26: X(j) = A(j,n+1) / A(j,j)
27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
28: ENDDO
29: end sub

 

This data parallel code works by performing operations on entire rows of the equation matrix. In the first loop, the forward reduction is performed. The variable <code>i</code> is an index which keeps track of the column (and thus row) that is currently being reduced.

Instead of reducing the rows starting with row 1, we will instead work with the row that has the largest element in the pivot column that hasn't already been used. We will find the row with which to work at the beginning of each iteration. (lines 12-14)

Next, we divide every element in the current row by the value in the column we are reducing. (lines 15-16)

To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0). This part of the algorithm is extremely parallelizable. The <code>FORALL</code> statement tells the compiler that this statement is parallelizable, and this statement implies data parallelism. (lines 18-20)

Lines 22-24 exchange rows to put the matrix in row-echelon form, and the <code>FORALL</code> statement again shows that this operation is highly parallelizable.

The last loop does the back substitution. The loop starts on the end and works its way to the first row.

First, the last row is solved. (line 26)
Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)

== Shared Memory ==

The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: do pivot = 1, (n-1)
2: !$omp parallel do private(xmult) schedule(runtime)
3: do i = (pivot+1), n
4: xmult = a(i,pivot) / a(pivot,pivot)
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
8: b(i) = b(i) - (xmult * b(pivot))
9: end do
10: 1: !$omp end parallel do
11: end do
 

As is readily seen, this code is short and simple, so we will analyze it line by line. To summarize the method, each row is solved serially, with the columns each being normalized and subtracted the remaining rows in parallel.

The pivot column is the column currently being reduced (and subsequently, also the row).

1: do pivot = 1, (n-1)
Loop through each row for forward reduction.
2: !$omp parallel do private(xmult) schedule(runtime)
Spawn parallel threads here, making <code>xmult</code> private.
3: do i = (pivot+1), n
Loop through all columns in the current row starting from the pivot.
4: xmult = a(i,pivot) / a(pivot,pivot)
Divide the i'th column by the pivot column. This is the normalization step.
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
Now, we loop through all other rows and update them with the just normalized column. This involves subtracting it the needed number of times.
8: b(i) = b(i) - (xmult * b(pivot))
We also update the solution column, which is here <code>b</code>
9: end do
10: 1: !$omp end parallel do
11: end do
Back substitution is done next, but is not shown.

== Message Passing ==
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: root = 0
2: chunk = n**2/p
3: ! main loop
4: do pivot = 1, n-1
5: ! root maintains communication
6: if (my_rank.eq.0) then
7: ! adjust the chunk size
8: if (MOD(pivot, p).eq.0) then
9: chunk = chunk - n
10: endif
11:
12: ! calculate chunk vectors
13: rem = MOD((n**2-(n*pivot)),chunk)
14: tmp = 0
15: do i = 1, p
16: tmp = tmp + chunk
17: if (tmp.le.(n**2-(n*pivot))) then
18: a_chnk_vec(i) = chunk
19: b_chnk_vec(i) = chunk / n
20: else
21: a_chnk_vec(i) = rem
22: b_chnk_vec(i) = rem / n
23: rem = 0
24: endif
25: continue
26:
27: ! calculate displacement vectors
28: a_disp_vec(1) = (pivot*n)
29: b_disp_vec(1) = pivot
30: do i = 2, p
31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33: continue
34:
35: ! fetch the pivot equation
36: do i = 1, n
37: pivot_eqn(i) = a(n-(i-1),pivot)
38: continue
39:
40: pivot_b = b(pivot)
41: endif ! my_rank.eq.0
42:
43: ! distribute the pivot equation
44: call MPI_BCAST(pivot_eqn, n,
45: MPI_DOUBLE_PRECISION,
46: root, MPI_COMM_WORLD, ierr)
47:
48: call MPI_BCAST(pivot_b, 1,
49: MPI_DOUBLE_PRECISION,
50: root, MPI_COMM_WORLD, ierr)
51:
52: ! distribute the chunk vector
53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54: chunk, 1, MPI_INTEGER,
55: root, MPI_COMM_WORLD, ierr)
56:
57: ! distribute the data
58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59: MPI_DOUBLE_PRECISION,
60: local_a, chunk,
61: MPI_DOUBLE_PRECISION,
62: root, MPI_COMM_WORLD,ierr)
63:
64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65: MPI_DOUBLE_PRECISION,
66: local_b, chunk/n,
67: MPI_DOUBLE_PRECISION,
68: root, MPI_COMM_WORLD,ierr)
69:
70: ! forward elimination
71: do j = 1, (chunk/n)
72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73: do i = (n-pivot), 1, -1
74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75: continue
76:
77: local_b(j) = local_b(j) - (xmult * pivot_b)
78: continue
79:
80: ! restore the data to root
81: call MPI_GATHERV(local_a, chunk,
82: MPI_DOUBLE_PRECISION,
83: a, a_chnk_vec, a_disp_vec,
84: MPI_DOUBLE_PRECISION,
85: root, MPI_COMM_WORLD, ierr)
86:
87: call MPI_GATHERV(local_b, chunk/n,
88: MPI_DOUBLE_PRECISION,
89: b, b_chnk_vec, b_disp_vec,
90: MPI_DOUBLE_PRECISION,
91: root, MPI_COMM_WORLD, ierr)
92: continue ! end of main loop
93:
94: ! backwards substitution done in parallel (not shown)
 
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the <code>chunk</code> variable. <code>chunk</code> is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making <code>chunk</code> smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is <code>root</code>, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions <code>MPI_BCAST</code>, <code>MPI_SCATTER</code>, and <code>MPI_SCATTERV</code> serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[[#3foot|[3]]]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the <code>MPI_GATHERV</code> function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the <code>MPI_</code> functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

= Definitions =
* ''HPF'' - High Performance FORTRAN
* ''MPI'' - Message Passing Interface, an API used for supporting message passing across processes.

= References =
[[#1body|1.]] S.F.McGinn and R.E.Shaw, University of New Brunswick, [http://hpds.ee.kuas.edu.tw/download/parallel_processing/96/96present/20071212/Gaussian.pdf Parallel Gaussian Elimination Using OpenMP and MPI] 
[[#2body|2.]] Ian Foster, Argonne National Laboratory, [http://www.mcs.anl.gov/~itf/dbpp/text/node90.html Case Study: Gaussian Elimination] 
[[#3body|3.]] [http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html MPI: A Message-Passing Interface Standard] 
[[#4body|4.]] Wikipedia's [http://en.wikipedia.org/wiki/Fortran FORTRAN] page 
[[#5body|5.]] Wikipedia's [http://en.wikipedia.org/wiki/Gaussian_elimination Gaussian Elimination] page

CSC/ECE 506 Spring 2011/ch4a bm

2011-02-28T04:47:34Z

Beburrou: /* Gaussian Elimination */

= Overview =
Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.

= Gaussian Elimination =
Gaussian Elimination is a common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.[[#5foot|[5]]] The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.

'''Forward Reduction'''

The first step is to reduce the equation matrix to [http://en.wikipedia.org/wiki/Echelon_form row-echelon form]. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of examples will help to illustrate row-echelon form:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 7
| 3
| 0
|
| -4
|-
| 0
| 0
| 1
| 10
| =
| 0.5
|-
| 0
| 0
| 0
| 1
|
| 6
|}
</blockquote>

'''Back Substitution'''

In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 3
| =
| 5
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element in equation 2, and subtract 3 from both sides:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 2
| -4
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

Substitute 1 for the third element and 2 for the second element in equation 1, and solve:

<blockquote>
{| border="1" cellspacing="5" cellpadding="8" align="center"
|-
| 1
| 0
| 0
|
| 2
|-
| 0
| 1
| 0
| =
| 2
|-
| 0
| 0
| 1
|
| 1
|}
</blockquote>

= FORTRAN Background =
The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array <code>A</code>
* Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
* All elements of an array can be set to a value by simply setting the array variable equal to a value, as in <code>A = 0</code>
* A <code>DO</code> loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as in <code>A(a:b) = A(a:b) * 2</code>
* In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in <code>A(1, :) = 2</code>

= Parallel Implementations =

== Data Parallel ==

The following section of code implements Gaussian Elimination via data parallel with HPF. It is found in the book ''Designing and Building Parallel Programs (Online)'' by Ian Foster.[[#2foot|[2]]] 

1: subroutine gauss(n, A, X)
2: integer n
3: real A(n, n+1)
5: real X(n), Fac(n), Row(n+1)
6: integer Indx(n), Itmp(1)
7: integer i, j, k, max_indx
8: real maxval
9:
10: Indx = 0 ! Initialize mask array
11: do i = 1, n
12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
13: max_indx = Itmp(1) ! Extract pivot index
14: Indx(max_indx) = i ! Update indirection array
15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column
16: Row = A(max_indx,:) ! Extract pivot row
17:
18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
19: A(j,k) = A(j,k) - Fac(j) * Row(k)
20: end do
21:
22: FORALL (j=1:n)
23: A(Indx(j),:) = A(j,:) ! Row exchange
24:
25: DO j = n,1,-1 ! Back substitution
26: X(j) = A(j,n+1) / A(j,j)
27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
28: ENDDO
29: end sub

 

This data parallel code works by performing operations on entire rows of the equation matrix. In the first loop, the forward reduction is performed. The variable <code>i</code> is an index which keeps track of the column (and thus row) that is currently being reduced.

Instead of reducing the rows starting with row 1, we will swap row 1 with the row that has the largest element in column 1. We will do this at the beginning of each iteration. (lines 12-14)

Next, we divide every element in the current row by the value in the column we are reducing. (line 15-16)

To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0). This part of the algorithm is extremely parallelizable. (lines 18-20)

The last loop does the back substitution. The loop starts on the end and works its way to the first row.

First, the current row is solved. (line 26)
Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)

== Shared Memory ==

The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: do pivot = 1, (n-1)
2: !$omp parallel do private(xmult) schedule(runtime)
3: do i = (pivot+1), n
4: xmult = a(i,pivot) / a(pivot,pivot)
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
8: b(i) = b(i) - (xmult * b(pivot))
9: end do
10: 1: !$omp end parallel do
11: end do
 

As is readily seen, this code is short and simple, so we will analyze it line by line. To summarize the method, each row is solved serially, with the columns each being normalized and subtracted the remaining rows in parallel.

The pivot column is the column currently being reduced (and subsequently, also the row).

1: do pivot = 1, (n-1)
Loop through each row for forward reduction.
2: !$omp parallel do private(xmult) schedule(runtime)
Spawn parallel threads here, making <code>xmult</code> private.
3: do i = (pivot+1), n
Loop through all columns in the current row starting from the pivot.
4: xmult = a(i,pivot) / a(pivot,pivot)
Divide the i'th column by the pivot column. This is the normalization step.
5: do j = (pivot+1), n
6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
7: end do
Now, we loop through all other rows and update them with the just normalized column. This involves subtracting it the needed number of times.
8: b(i) = b(i) - (xmult * b(pivot))
We also update the solution column, which is here <code>b</code>
9: end do
10: 1: !$omp end parallel do
11: end do
Back substitution is done next, but is not shown.

== Message Passing ==
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[[#1foot|[1]]] 

1: root = 0
2: chunk = n**2/p
3: ! main loop
4: do pivot = 1, n-1
5: ! root maintains communication
6: if (my_rank.eq.0) then
7: ! adjust the chunk size
8: if (MOD(pivot, p).eq.0) then
9: chunk = chunk - n
10: endif
11:
12: ! calculate chunk vectors
13: rem = MOD((n**2-(n*pivot)),chunk)
14: tmp = 0
15: do i = 1, p
16: tmp = tmp + chunk
17: if (tmp.le.(n**2-(n*pivot))) then
18: a_chnk_vec(i) = chunk
19: b_chnk_vec(i) = chunk / n
20: else
21: a_chnk_vec(i) = rem
22: b_chnk_vec(i) = rem / n
23: rem = 0
24: endif
25: continue
26:
27: ! calculate displacement vectors
28: a_disp_vec(1) = (pivot*n)
29: b_disp_vec(1) = pivot
30: do i = 2, p
31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33: continue
34:
35: ! fetch the pivot equation
36: do i = 1, n
37: pivot_eqn(i) = a(n-(i-1),pivot)
38: continue
39:
40: pivot_b = b(pivot)
41: endif ! my_rank.eq.0
42:
43: ! distribute the pivot equation
44: call MPI_BCAST(pivot_eqn, n,
45: MPI_DOUBLE_PRECISION,
46: root, MPI_COMM_WORLD, ierr)
47:
48: call MPI_BCAST(pivot_b, 1,
49: MPI_DOUBLE_PRECISION,
50: root, MPI_COMM_WORLD, ierr)
51:
52: ! distribute the chunk vector
53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54: chunk, 1, MPI_INTEGER,
55: root, MPI_COMM_WORLD, ierr)
56:
57: ! distribute the data
58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59: MPI_DOUBLE_PRECISION,
60: local_a, chunk,
61: MPI_DOUBLE_PRECISION,
62: root, MPI_COMM_WORLD,ierr)
63:
64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65: MPI_DOUBLE_PRECISION,
66: local_b, chunk/n,
67: MPI_DOUBLE_PRECISION,
68: root, MPI_COMM_WORLD,ierr)
69:
70: ! forward elimination
71: do j = 1, (chunk/n)
72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73: do i = (n-pivot), 1, -1
74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75: continue
76:
77: local_b(j) = local_b(j) - (xmult * pivot_b)
78: continue
79:
80: ! restore the data to root
81: call MPI_GATHERV(local_a, chunk,
82: MPI_DOUBLE_PRECISION,
83: a, a_chnk_vec, a_disp_vec,
84: MPI_DOUBLE_PRECISION,
85: root, MPI_COMM_WORLD, ierr)
86:
87: call MPI_GATHERV(local_b, chunk/n,
88: MPI_DOUBLE_PRECISION,
89: b, b_chnk_vec, b_disp_vec,
90: MPI_DOUBLE_PRECISION,
91: root, MPI_COMM_WORLD, ierr)
92: continue ! end of main loop
93:
94: ! backwards substitution done in parallel (not shown)
 
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the <code>chunk</code> variable. <code>chunk</code> is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making <code>chunk</code> smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is <code>root</code>, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions <code>MPI_BCAST</code>, <code>MPI_SCATTER</code>, and <code>MPI_SCATTERV</code> serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[[#3foot|[3]]]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the <code>MPI_GATHERV</code> function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the <code>MPI_</code> functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

= Definitions =
* ''HPF'' - High Performance FORTRAN
* ''MPI'' - Message Passing Interface, an API used for supporting message passing across processes.

= References =
[[#1body|1.]] S.F.McGinn and R.E.Shaw, University of New Brunswick, [http://hpds.ee.kuas.edu.tw/download/parallel_processing/96/96present/20071212/Gaussian.pdf Parallel Gaussian Elimination Using OpenMP and MPI] 
[[#2body|2.]] Ian Foster, Argonne National Laboratory, [http://www.mcs.anl.gov/~itf/dbpp/text/node90.html Case Study: Gaussian Elimination] 
[[#3body|3.]] [http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html MPI: A Message-Passing Interface Standard] 
[[#4body|4.]] Wikipedia's [http://en.wikipedia.org/wiki/Fortran FORTRAN] page 
[[#5body|5.]] Wikipedia's [http://en.wikipedia.org/wiki/Gaussian_elimination Gaussian Elimination] page

CSC/ECE 506 Spring 2011/ch4a bm

2011-02-27T21:59:53Z

Beburrou: /* Definitions */ Added MPI

CSC/ECE 506 Spring 2011/ch4a bm

2011-02-27T21:57:33Z

Beburrou: /* Message Passing */