<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Eapotter</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Eapotter"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Eapotter"/>
	<updated>2026-06-02T22:32:55Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45189</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45189"/>
		<updated>2011-04-19T01:48:26Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI)&amp;quot;&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#References|[1]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; .)&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors.  Likewise, these memory states are impacted when these memory transactions occur.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these memory states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_state.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45188</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45188"/>
		<updated>2011-04-19T01:48:10Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI)&amp;quot;&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#References|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; .)&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors.  Likewise, these memory states are impacted when these memory transactions occur.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these memory states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_state.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45187</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45187"/>
		<updated>2011-04-19T01:45:57Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Background */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
(Unless otherwise noted, the contents of this article is derived from the IEEE standardization document titled &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI)&amp;quot;.)&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors.  Likewise, these memory states are impacted when these memory transactions occur.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these memory states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_state.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45186</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45186"/>
		<updated>2011-04-19T01:40:25Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
These states are used by the coherency model to determine the state of the cache line when memory transactions are performed by individual processors.  Likewise, these memory states are impacted when these memory transactions occur.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these memory states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_state.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45184</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45184"/>
		<updated>2011-04-19T01:38:59Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_state.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45183</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45183"/>
		<updated>2011-04-19T01:38:27Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts the transition between these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:memory_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45182</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45182"/>
		<updated>2011-04-19T01:37:04Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* History */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Memory_state.png&amp;diff=45181</id>
		<title>File:Memory state.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Memory_state.png&amp;diff=45181"/>
		<updated>2011-04-19T01:36:37Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: Memory State Diagram&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Memory State Diagram&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45174</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45174"/>
		<updated>2011-04-19T01:03:04Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45173</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45173"/>
		<updated>2011-04-19T01:02:48Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45171</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45171"/>
		<updated>2011-04-19T01:01:12Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45165</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45165"/>
		<updated>2011-04-19T00:58:08Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''Data Modify''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45147</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45147"/>
		<updated>2011-04-19T00:23:51Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
* '''Fetch R''' - indicates a request for a memory block with read privileges.&lt;br /&gt;
* '''Fetch RW''' - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
* '''DATA_MODIFY''' - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45144</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45144"/>
		<updated>2011-04-19T00:22:55Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of memory transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of bytes requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45143</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45143"/>
		<updated>2011-04-19T00:21:53Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45142</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45142"/>
		<updated>2011-04-19T00:21:38Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY''' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH''' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY''' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH''' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID''' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45141</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45141"/>
		<updated>2011-04-19T00:20:50Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '''ONLY_DIRTY”' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''ONLY_FRESH”' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '''HEAD_DIRTY”' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
* '''HEAD_FRESH”' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '''MID_VALID”' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '''TAIL_VALID”' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45139</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45139"/>
		<updated>2011-04-19T00:19:58Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
* '“ONLY_DIRTY”' - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
* '”ONLY_FRESH”' - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
* '”HEAD_DIRTY”' - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH”' - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
* '”MID_VALID”' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
* '”TAIL_VALID”' - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45136</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45136"/>
		<updated>2011-04-19T00:18:44Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
*“ONLY_DIRTY” - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
*”ONLY_FRESH” - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
*”HEAD_DIRTY” - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH” - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
*”MID_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
*”TAIL_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45135</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45135"/>
		<updated>2011-04-19T00:18:03Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* States of the Typical Set */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
*“ONLY_DIRTY” - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
*”ONLY_FRESH” - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
*”HEAD_DIRTY” - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH” - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
*”MID_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
*”TAIL_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
Below is a state diagram that depicts these states:&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
The request sub-actions of transactions that expect a response from the directory include 17 variations of reads, writes, and locks based on the number of blocks requested and the coherency requirements.  For convenience, the notation on the edges of the above graph is derived from the terms used in the SCI specification to describe the memory transactions. &lt;br /&gt;
&lt;br /&gt;
The first value indicates the request to the directory by a node for memory access:&lt;br /&gt;
*”Fetch R” - indicates a request for a memory block with read privileges.&lt;br /&gt;
*”Fetch RW” - indicates a request for a memory block with read/write privileges.&lt;br /&gt;
*”DATA_MODIFY” - indicates that a processor (the HEAD or the ONLY processor caching the block) writes to the cache line.&lt;br /&gt;
&lt;br /&gt;
The value in the parenthesis indicates the memory state at the time of the request.&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45134</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45134"/>
		<updated>2011-04-19T00:16:37Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
====States of the Typical Set====&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
*“ONLY_DIRTY” - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
*”ONLY_FRESH” - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
*”HEAD_DIRTY” - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH” - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
*”MID_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
*”TAIL_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45133</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45133"/>
		<updated>2011-04-19T00:16:05Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
States of the Typical Set&lt;br /&gt;
&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
*“ONLY_DIRTY” - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
*”ONLY_FRESH” - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
*”HEAD_DIRTY” - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH” - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
*”MID_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
*”TAIL_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45132</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45132"/>
		<updated>2011-04-19T00:15:24Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
States of the Typical Set&lt;br /&gt;
Following are the states defined for the Typical set.  (Note that other states may be supported to provide certain interoperability with nodes implementing the Full set, but are not used when all nodes support the Typical set.)&lt;br /&gt;
&lt;br /&gt;
*“ONLY_DIRTY” - only one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and no other processor currently caches the block.&lt;br /&gt;
*”ONLY_FRESH” - only one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and no other processor currently caches the block.&lt;br /&gt;
*”HEAD_DIRTY” - more than one processor has the memory block in its cache.  This block is writable, and the processor has written (or intends) to write to it.  This state is set when the processor requests the block with read/write privileges, and another processors already caches the block.&lt;br /&gt;
*”HEAD_FRESH” - more than one processor had the memory block in its cache.  This block is writeable, but processor has not written to it.  This state is set when the processor requests the block with read privileges, and another processors already caches the block.&lt;br /&gt;
*”MID_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is neither the Head or the Tail of the of the sharing list.&lt;br /&gt;
*”TAIL_VALID” - more than two processors have the memory in its cache, and it is readable.  The processor cache with this state is the Tail of the of the sharing list.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as querying memory for a data or invalidating a node's cache.  This usually takes the form a a request-response pair, where the initiating node sends a request to another node and waits for the response.  The transaction is not complete until the response returns.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, having requested a block of data from memory and not having received a response, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' will be in the ''PENDING''&amp;lt;sup&amp;gt;&amp;lt;span class = &amp;quot;plainlinks&amp;quot;&amp;gt;[[#Definitions_and_Terms|[def]]]&amp;lt;/span&amp;gt;&amp;lt;/sup&amp;gt; state, so any requests to demote it from the ''Head'' position will be delayed..  This could lead to a chaining of nodes in the ''PENDING'' state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
The design of the SCI protocol prevents many race conditions from occurring, as has already been shown.  Here are two other conditions that may arise, with a discussion of how the protocol resolves the issue.&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
Even though the ''Head'' node performs most of the coherence actions for a shard list, the individual nodes are able to invalidate themselves, such as when they evict the block from their cache.  This is referred to in the SCI standard as a &amp;quot;deletion&amp;quot; from the sharing list.  Deletion is accomplished by having the invalidating node &amp;quot;lock&amp;quot; itself and then inform its forward and back nodes that they should now point to each other.  This &amp;quot;locking&amp;quot; is essentially another ''PENDING'' state.  A problem could arise, then, when two neighboring nodes try to invalidate themselves at the same time, as they would both be locked and not respond to each other's message.  In this case, though, the protocol specifies that the node that is closest to the tail takes priority.  So, the deadlock and/or race condition is averted by ordering the deletion from the tail towards the head.&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''PENDING state'' - The intermediate state for a node while it is attempting to become the ''Head'' node in the sharing list.  This includes the transaction involved in querying the ''Home'' node and the transaction involved in demoting the current ''Head'' node.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45105</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45105"/>
		<updated>2011-04-18T22:32:01Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL, and the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.  Note, however, that nodes will not stay in the busy state indefinitely.  Doing so would lead to potential deadlocks, so all transactions have time outs that will eventually cause the transaction to fail, thus moving the node out of the busy state, making it able to respond to requests again.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the ''Head'' node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current ''Head'' node as the node that can write.  All state changes in the sharing list occur at the behest of the ''Head'' node, with the exception of nodes deleting themselves from the list.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the ''Head'' node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the ''Head'' node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the ''Head'' node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the ''Head'' node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the ''Head'' node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
It should be noted that ''B'' is also in the busy state, so any requests made of it will also be rejected.  This could lead to a chaining of nodes in the busy state, but as soon as node ''A'' receives its response from ''Home'' or times out, the requests in the chain will be successively accepted and processed.&lt;br /&gt;
&lt;br /&gt;
=== Other Possible Race Conditions ===&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
* ''Head Node'' - The node at the beginning of the sharing list&lt;br /&gt;
* ''Home Node'' - The node responsible for keeping track of the ''Head'' node for a sharing list for a memory block, usually the node at which the memory block resides.&lt;br /&gt;
* ''SCI'' - Scalable Coherent Interface&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&amp;lt;span id=&amp;quot;1foot&amp;quot;&amp;gt;[[#1body|1.]]&amp;lt;/span&amp;gt; &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049 &amp;lt;br&amp;gt;&lt;br /&gt;
&amp;lt;span id=&amp;quot;2foot&amp;quot;&amp;gt;[[#2body|2.]]&amp;lt;/span&amp;gt; [http://www.scizzl.com/HowSCIcohWorks.html How SCI Coherence Works] &amp;lt;br&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45050</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45050"/>
		<updated>2011-04-18T15:16:01Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Memory States &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Solihin 11.4 Resolution ====&lt;br /&gt;
The three previous sections discussed individual parts of how the SCI protocol reduces race conditions.  Putting all three of these parts together yields the following scenario which resolves the Early Invalidation race as described in the text.&lt;br /&gt;
[[Image:SolihinSCI.png|center]]&lt;br /&gt;
&lt;br /&gt;
# ''A'' sends a request to ''Home'' for access to the memory block.  It then goes into a busy state while it waits for a response.&lt;br /&gt;
# ''B'' also sends a request to ''Home'' for access to the same memory block.  ''A'''s request is received first&lt;br /&gt;
# ''Home'' responds to ''A'' with the data, but this response gets caught in network traffic and delayed.  This response is sent before ''Home'' processes the request from ''B''.&lt;br /&gt;
# ''Home'' responds to the request from ''B'', telling it that ''A'' is the current ''Head'' node.&lt;br /&gt;
# ''B'' then sends a request to ''A'' to tell it to demote itself.&lt;br /&gt;
# ''A'' still hasn't received the response from ''Home'', so it is still in a transaction, and it tells ''B'' that it is busy.  ''B'' will have to retry the request.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45030</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45030"/>
		<updated>2011-04-18T04:07:44Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
* '''Minimal''' - for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
* '''Typical''' - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
* '''Full''' - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;5&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics &lt;br /&gt;
|----&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
==== Example Resolution ====&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45026</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45026"/>
		<updated>2011-04-18T04:02:18Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
minimal - is for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
typical - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
full - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head''.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  In the event that the data is already being used, the ''Home'' node will include in its response the address of the ''Head'' node of the sharing list.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.  As well, whichever node is the ''Head'' of the sharing list will be notified by the new ''Head'' when it is demoted.  All parties involved in a potential race are thus talking to each other, preventing the race from occurring in the first place.&lt;br /&gt;
&lt;br /&gt;
Because of this requirement, the projected race condition from the previous section, where the ''Head'' node is in a ''FRESH'' state and wants to write while another node also wants to read, cannot occur.  This is because both nodes must make their request to the ''Home'' node; the ''Head'' node must request the ability to write, and the other node must request the data from the ''Home'' node.  One of these requests will make it to the ''Home'' node first, resulting in the second request being deferred to the first node to reach to ''Home'' node.  Such a scenario is pictured below.&lt;br /&gt;
[[Image:MemoryAccess.png|center]]&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45023</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45023"/>
		<updated>2011-04-18T03:55:18Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
minimal - is for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
typical - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
full - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
!State&lt;br /&gt;
!Description&lt;br /&gt;
!Minimal&lt;br /&gt;
!Typical&lt;br /&gt;
!Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45022</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45022"/>
		<updated>2011-04-18T03:53:08Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
minimal - is for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
typical - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
full - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
{|&lt;br /&gt;
|State&lt;br /&gt;
|Description&lt;br /&gt;
|Minimal&lt;br /&gt;
|Typical&lt;br /&gt;
|Full&lt;br /&gt;
|----&lt;br /&gt;
|HOME&lt;br /&gt;
|no sharing list&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|FRESH&lt;br /&gt;
|sharing-list copy is the same as memory&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|GONE&lt;br /&gt;
|sharing-list copy may be different from memory&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|WASH*&lt;br /&gt;
|transitional state (GONE to FRESH)&lt;br /&gt;
|&lt;br /&gt;
|&lt;br /&gt;
|Y&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45021</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45021"/>
		<updated>2011-04-18T03:44:53Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Background */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
The Scalable Coherent Interface (SCI) cache coherence protocol defines a set of states for memory, a set of states for cache lines, and a set of actions that transition between these states when processors access memory.&lt;br /&gt;
&lt;br /&gt;
Three sets of these attributes are defined for minimal, typical, and full applications:&lt;br /&gt;
&lt;br /&gt;
minimal - is for ‘trivial but correct’ applications that require the presence of the memory in only one cache line.  It does not enable read-sharing, and is appropriate for small multi-processors or applications that don’t require significant sharing.&lt;br /&gt;
&lt;br /&gt;
typical - enables read-sharing of a memory location and provisions for efficiency, such as DMA transfers, local caching of data, and error recovery.  This set adds an additional stable memory state (FRESH) and multiple cache states.  This set will be the focus of this article going forward.&lt;br /&gt;
&lt;br /&gt;
full - adds support for pair-wise sharing, QOLB lock bit synchronization, and cleansing and washing of cache lines.&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45020</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45020"/>
		<updated>2011-04-18T03:42:55Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
[1] &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993. doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45019</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45019"/>
		<updated>2011-04-18T03:40:41Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
The cache-line states are maintained by each processors cache-coherency controller.  This 7-bit field is stored in each cache line in the sharing list, along with the pointer to the next sharing-list node (forwId) and the previous sharing-list node (backID).  Seven bits enable up to 128 possible values, and SCI defines twenty-nine stable-states for use in the minimal, typical, and full sets.  &lt;br /&gt;
&lt;br /&gt;
Stable states are those cache states that exist when a memory transaction is not in process.  Their names are derived from a combination of&lt;br /&gt;
the position of the node in the sharing list, such as ONLY, HEAD, MID, and TAIL&lt;br /&gt;
the state of the data for that node, such as FRESH, CLEAN, DIRTY, VALID, STALE, etc.&lt;br /&gt;
&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45018</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45018"/>
		<updated>2011-04-18T03:40:07Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory States */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
The memory states define the state of the memory block from the perspective of the home directory.  This 2-bit field is maintained in the memory tag by the home directory along with the pointer to the head of the sharing list (forwId).  This simple state model includes three stable states and one semi-stable state.&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
The final way that the SCI protocol minimizes race conditions is by changing the directory structure.  In Solihin 11.4, there is a ''Home'' node which keeps track of the cache state and has a certain amount of knowledge about the system as a result.  In the SCI protocol, there is still a ''Home'' node of sorts, but this node is only responsible for keeping track of who the current ''Head'' node is.  Such a ''Home'' node is usually the node where the memory block physically resides.  When a node wants to access a block of memory, it sends a request to the ''Home'' node of that memory block, assuming it isn't already in the sharing list for that block.  This requirement necessarily serializes the order of access, as one access request is not serviced at the ''Home'' node until the previous request is finished, and the requests are processed in FIFO order.&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45016</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45016"/>
		<updated>2011-04-18T03:38:33Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Memory */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory States ===&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45015</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45015"/>
		<updated>2011-04-18T03:38:21Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory ===&lt;br /&gt;
&lt;br /&gt;
=== Cache States ===&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45014</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=45014"/>
		<updated>2011-04-18T03:37:44Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== History ==&lt;br /&gt;
&lt;br /&gt;
== Background ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== State Diagrams ==&lt;br /&gt;
=== Memory ===&lt;br /&gt;
&lt;br /&gt;
=== Processors ===&lt;br /&gt;
[[Image:cache_states.png|center]]&lt;br /&gt;
&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
In a distributed shared memory system with caching, the emergence of race conditions is extremely likely.  This is mainly due to the lack of a bus for serialization of actions.  It is further compounded by the problem of network errors and congestion.&lt;br /&gt;
&lt;br /&gt;
The early invalidation case from Section 11.4 in Solihin is an excellent example of a race condition that can arise in a distributed system.  Recall the diagram from the text, and the cache coherence actions, as shown below.&lt;br /&gt;
&lt;br /&gt;
[[Image:EarlyInValidationRace.png|400px|center]]&lt;br /&gt;
The circled actions are as follows:&lt;br /&gt;
# ''A'' sends a read request to ''Home''.&lt;br /&gt;
# ''Home'' replies with data (but the message gets delayed).&lt;br /&gt;
# ''B'' sends a write request to ''Home''.&lt;br /&gt;
# ''Home'' sends invalidation to ''A'', and it arrives before the ReplyD&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The SCI protocol has a way to handle this race condition, as well as many others, and the following sections will discuss how the SCI protocol design can prevent this race condition.&lt;br /&gt;
&lt;br /&gt;
=== Prevention in the SCI Protocol ===&lt;br /&gt;
Race conditions are almost non-existent in the SCI protocol, due primarily to the protocol's design.  A brief discussion of how this is accomplished follows, followed by a discussion showing what happens if this condition arises in the SCI protocol.&lt;br /&gt;
==== Atomic Transactions ====&lt;br /&gt;
SCI's primary method for preventing race conditions is having atomic transactions.  A transaction is defined as a set of sub-actions necessary to complete some requested action, such as reading from memory or writing to a variable.  Suppose, for example, that node ''A'' is in the middle of a transaction with node ''B''.  Node ''C'' then tries to make a request of node ''A''.  Node ''A'' will respond to node ''C'' that it is busy, telling node ''C'' to try again, as shown in the following diagram.&lt;br /&gt;
[[Image:AtomicBusy.png|center]]&lt;br /&gt;
Thus, in the Early Invalidation race, node ''A'' would be in the middle of a transaction, which would prevent node ''B'' from invalidating it.&lt;br /&gt;
&lt;br /&gt;
==== Head Node ====&lt;br /&gt;
Another method for preventing race conditions is to have the Head Node of a sharing list perform many of the coherence actions.  As a result, only one node is performing actions such as writes and invalidations of other sharers.  Since only one node is performing these actions, the possibility of concurrent actions is decreased.  If another node wants to write, for instance, it must become the Head Node of the sharing list for the cache line to which it wants to write, displacing the current Head Node as the node that can write.&lt;br /&gt;
&lt;br /&gt;
When simply sharing a read-only copy of a node, as in when the memory is a ''FRESH'' state, the Head Node is somewhat irrelevant.  All the sharing nodes have their own cached value of the cache line.  If any node wants to write, including the Head Node, it must perform an additional action in order to do so.  Likewise, when a sharing list is sharing a line of data in the dirty, or ''GONE'', memory state, as long as the Head Node has not written to the line, then all the sharing nodes stay in the list with their cached line.  However, at the point the Head Node wants to write, it can write its data immediately, but it then must invalidate all other shared copies, via the forward pointers in the sharing list.&lt;br /&gt;
&lt;br /&gt;
This, however, does not protect against a race condition where the memory is ''FRESH'', the Head Node wants to write, and another node also wants to join the list.  The Memory Access mechanism prevents this condition.&lt;br /&gt;
&lt;br /&gt;
==== Memory Access ====&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=== Possible Race Conditions ===&lt;br /&gt;
==== Communication Delays ====&lt;br /&gt;
&lt;br /&gt;
==== Concurrent List Deletions ====&lt;br /&gt;
&lt;br /&gt;
==== Simultaneous Deletion and Invalidation ====&lt;br /&gt;
&lt;br /&gt;
== Summary ==&lt;br /&gt;
&lt;br /&gt;
== Definitions and Terms ==&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Cache_states.png&amp;diff=45013</id>
		<title>File:Cache states.png</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Cache_states.png&amp;diff=45013"/>
		<updated>2011-04-18T03:34:44Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: Cache State Diagram&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Cache State Diagram&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44977</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44977"/>
		<updated>2011-04-16T19:42:09Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
= Overview =&lt;br /&gt;
= State Diagram =&lt;br /&gt;
= Race Conditions =&lt;br /&gt;
= References =&lt;br /&gt;
[1]:&amp;lt;br /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44976</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44976"/>
		<updated>2011-04-16T19:02:42Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
= Overview =&lt;br /&gt;
= State Diagram =&lt;br /&gt;
= Race Conditions =&lt;br /&gt;
= References =&lt;br /&gt;
[1]: &amp;quot;IEEE Standard for Scalable Coherent Interface (SCI).,&amp;quot; IEEE Std 1596-1992 , vol., no., pp.i, 1993.  doi: 10.1109/IEEESTD.1993.120366&lt;br /&gt;
URL: http://ieeexplore.ieee.org.proxied.lib.ncsu.edu/stamp/stamp.jsp?tp=&amp;amp;arnumber=347683&amp;amp;isnumber=8049&amp;lt;br /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44940</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44940"/>
		<updated>2011-04-15T15:02:53Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
= Overview =&lt;br /&gt;
= State Diagram =&lt;br /&gt;
= Race Conditions =&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44939</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44939"/>
		<updated>2011-04-15T15:02:15Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
== State Diagram ==&lt;br /&gt;
== Race Conditions ==&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44938</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44938"/>
		<updated>2011-04-15T15:00:30Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;__TOC__&lt;br /&gt;
&lt;br /&gt;
== Overview ==&lt;br /&gt;
== State Diagram ==&lt;br /&gt;
== Race Conditions ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44937</id>
		<title>CSC/ECE 506 Spring 2011/ch11 BB EP</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch11_BB_EP&amp;diff=44937"/>
		<updated>2011-04-15T15:00:08Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;== Overview ==&lt;br /&gt;
== State Diagram ==&lt;br /&gt;
== Race Conditions ==&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=44936</id>
		<title>ECE506 Main Page</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=ECE506_Main_Page&amp;diff=44936"/>
		<updated>2011-04-15T14:51:26Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.&lt;br /&gt;
&lt;br /&gt;
=Supplements to Solihin Text=&lt;br /&gt;
&lt;br /&gt;
Post links to the textbook supplements in this section.&lt;br /&gt;
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]&lt;br /&gt;
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]&lt;br /&gt;
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]&lt;br /&gt;
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]&lt;br /&gt;
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]&lt;br /&gt;
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]&lt;br /&gt;
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms  ]]&lt;br /&gt;
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]&lt;br /&gt;
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]&lt;br /&gt;
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]&lt;br /&gt;
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]&lt;br /&gt;
*Chapter 8 (Under Construction) [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]&lt;br /&gt;
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]&lt;br /&gt;
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]&lt;br /&gt;
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44351</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44351"/>
		<updated>2011-03-08T04:23:37Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Recent Architectures and their Cache Characteristics */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
The next two sections will discuss cache write policies and cache prefetching as techniques to improve the performance of these complex caching architectures by reducing write-miss and read-miss rates.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]], write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44350</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44350"/>
		<updated>2011-03-08T04:22:02Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Recent Architectures and their Cache Characteristics */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
The next two sections will discuss cache write policies and cache prefetching as techniques to improve cache performance by reducing write-miss and read-miss rates.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]], write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44349</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44349"/>
		<updated>2011-03-08T04:20:40Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Recent Architectures and their Cache Characteristics */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
The next two sections will discuss cache write policies and cache prefetching as two techniques improve cache performance by reducing write-miss and read-miss rates.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]], write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44348</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44348"/>
		<updated>2011-03-08T04:15:34Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Prefetching in Contemporary Parallel Processors */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]], write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44347</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44347"/>
		<updated>2011-03-08T04:14:12Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Cache Write Policies[5] */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]], write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater percentage of cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44346</id>
		<title>CSC/ECE 506 Spring 2011/ch6a ep</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2011/ch6a_ep&amp;diff=44346"/>
		<updated>2011-03-08T04:10:40Z</updated>

		<summary type="html">&lt;p&gt;Eapotter: /* Write-Invalidate */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;INTRODUCTION TO MEMORY HIERARCHY ORGANIZATION &amp;lt;br/&amp;gt;&lt;br /&gt;
Write-Miss Policies and Prefetching&lt;br /&gt;
&lt;br /&gt;
__TOC__&lt;br /&gt;
&lt;br /&gt;
Write miss policies and prefetching are two strategies that are used by multiprocessors to achieve optimal performance for memory accesses.  Write miss policies can help eliminate write misses and therefore reduce bus traffic.  This reduces the wait-time of all processors sharing the bus.  Prefetching assures that a CPU has data blocks in cache to be read when processing large data files and streaming data.  When CPUs or cores share a cache, prefetched data in a shared cache is available to all processes on those cores processing the data.&lt;br /&gt;
&lt;br /&gt;
This article begins by highlighting the variety of multicore processors on the market today that have hierarchal memory structures and shared caches.  It then explores the write miss policies and prefetching techniques that these multiprocessors can use to take advantage of these architectures.&lt;br /&gt;
&lt;br /&gt;
= Recent Architectures and their Cache Characteristics =&lt;br /&gt;
&lt;br /&gt;
Cache management is impacted by three characteristics of modern processor architectures:  multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.&lt;br /&gt;
&lt;br /&gt;
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace.  Two-level and three-level cache hierarchies are common.  L1 typically ranges from 16-64KB and provide access in 2-4 cycles.  L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles.  L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[[#References|[8]]]&lt;br /&gt;
&lt;br /&gt;
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.&lt;br /&gt;
&lt;br /&gt;
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor.  Typically, lower-level caches are private and high-level caches may or may not be shared.   Notice that none of the examples in the below have a shared L1 cache.  Cache management functions must consider both shared and private caches when reading and writing data from memory.&lt;br /&gt;
&lt;br /&gt;
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.&lt;br /&gt;
&lt;br /&gt;
{| border=&amp;quot;1&amp;quot; cellspacing=&amp;quot;0&amp;quot; cellpadding=&amp;quot;2&amp;quot;&lt;br /&gt;
!colspan=&amp;quot;7&amp;quot;| Table 1: Recent Architectures and their Cache Characteristics [[#References|[1]]][[#References|[2]]][[#References|[3]]][[#References|[4]]][[#References|[9]]][[#References|[10]]][[#References|[11]]][[#References|[12]]][[#References|[13]]][[#References|[14]]][[#References|[15]]][[#References|[16]]]&lt;br /&gt;
|----&lt;br /&gt;
!Company&lt;br /&gt;
!Processor&lt;br /&gt;
!Cores&lt;br /&gt;
!L1 Cache&lt;br /&gt;
!L2 Cache&lt;br /&gt;
!L3 Cache&lt;br /&gt;
!Released&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon 64 FX&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|2 MB&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|2 MB&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2 1MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Athlon II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X2&lt;br /&gt;
|2&lt;br /&gt;
|128 KB x 2&lt;br /&gt;
|512 KB x 2&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X3&lt;br /&gt;
|3&lt;br /&gt;
|128 KB x 3&lt;br /&gt;
|512 KB x 3&lt;br /&gt;
|6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X4&lt;br /&gt;
|4&lt;br /&gt;
|128 KB x 4&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|4-6 MB&lt;br /&gt;
|2009&lt;br /&gt;
|----&lt;br /&gt;
|AMD&lt;br /&gt;
|Phenom II X6&lt;br /&gt;
|6&lt;br /&gt;
|128 KB x 6&lt;br /&gt;
|512 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|12+16KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Celeron E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|512 -1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium D&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|1 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|2 - 4 MB&lt;br /&gt;
|&lt;br /&gt;
|2006&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x2&lt;br /&gt;
|8 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Atom 330&lt;br /&gt;
|2&lt;br /&gt;
|32+24KB x 2&lt;br /&gt;
|512 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Pentium E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|2 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Duo E&lt;br /&gt;
|2&lt;br /&gt;
|32 KB x 2&lt;br /&gt;
|3-6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core 2 Quad&lt;br /&gt;
|4&lt;br /&gt;
|32 KB x 4&lt;br /&gt;
|2-6 MB x 2&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i3&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 6 Series&lt;br /&gt;
|2&lt;br /&gt;
|32+32 KB x 2&lt;br /&gt;
|256 KB x2&lt;br /&gt;
|4 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 7 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i5 - 2400 Series Core i5 - 2500 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|6 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 8 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 9 Series&lt;br /&gt;
|4&lt;br /&gt;
|32+32 KB x 4&lt;br /&gt;
|256 KB x 4&lt;br /&gt;
|8 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Intel&lt;br /&gt;
|Core i7 - 970&lt;br /&gt;
|6&lt;br /&gt;
|32+32 KB x 6&lt;br /&gt;
|256 KB x 6&lt;br /&gt;
|12 MB&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T1&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|3 MB&lt;br /&gt;
|&lt;br /&gt;
|2005&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VI&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|5 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|UltraSPARC T2&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 4 Inst. 16 K x 4 Data&lt;br /&gt;
|4 MB&lt;br /&gt;
|&lt;br /&gt;
|2007&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII&lt;br /&gt;
|4&lt;br /&gt;
|64 K x 4 Inst. 64 K x 4 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2008&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC T3&lt;br /&gt;
|8&lt;br /&gt;
|8 K x 8 Inst. 16 K x 8 Data&lt;br /&gt;
|6 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|Sun&lt;br /&gt;
|SPARC64 VII+&lt;br /&gt;
|2&lt;br /&gt;
|128 K x 2 Inst. 128 K x 2 Data&lt;br /&gt;
|12 MB&lt;br /&gt;
|&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|IBM &lt;br /&gt;
|Power5&lt;br /&gt;
|2&lt;br /&gt;
|64 K x 2 Inst. 64 K x 2 Data&lt;br /&gt;
|4 MB x 2&lt;br /&gt;
|32 MB&lt;br /&gt;
|2004&lt;br /&gt;
|----&lt;br /&gt;
|IBM&lt;br /&gt;
|Power7&lt;br /&gt;
|4, 6, or 8&lt;br /&gt;
|32+32 KB x C&lt;br /&gt;
|256 kB x C&lt;br /&gt;
|4 - 32 MB x C&lt;br /&gt;
|2010&lt;br /&gt;
|----&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
Note:  The 'x #' in the L1 and L2 columns indicates that this cache is for each core.&lt;br /&gt;
&lt;br /&gt;
=Cache Write Policies[[#References|[5]]]=&lt;br /&gt;
In section 6.2.3[[#References|[8]]], cache write hit policies and write miss policies were explored.  The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory.  As review, write-through writes data to the cache and memory on a write.  Write-back writes to cache first and to memory only when a flush is required.  &lt;br /&gt;
&lt;br /&gt;
The write miss policies covered in the text[[#References|[8]]] write-allocate and no-write-allocate, determine if a memory block is stored in a cache line after the write occurs.  Note that since a miss occurred on write, the block is not already in a cache line.&lt;br /&gt;
&lt;br /&gt;
Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit.   These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy.  Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.&lt;br /&gt;
&lt;br /&gt;
The following discusses each of these policies:&lt;br /&gt;
&lt;br /&gt;
==Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block containing the address to be written is fetched from the lower level memory hierarchy before the write proceeds.  Note that this is different from write-allocate. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line.  Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.&lt;br /&gt;
&lt;br /&gt;
==No-Fetch-on-Write==&lt;br /&gt;
On a write miss, the memory block is not fetched first from the lower level memory hierarchy.  Therefore, the write can proceed with out having to wait for the memory block to be returned.&lt;br /&gt;
&lt;br /&gt;
==Write-Before-Hit==&lt;br /&gt;
On a write, the write proceeds before the cache determines if a hit or miss occurred.  In this scenario, the tag and the data can be written simultaneously, but it incurs an immediate bus transaction for each write by the processor.&lt;br /&gt;
&lt;br /&gt;
==No-Write-Before-Hit==&lt;br /&gt;
On a write, the write waits until the cache determines if the block being written to is in the cache or not.  This may avoid a bus transaction by allowing the processor to write to the cache multiple times before the cache line is flushed to memory.&lt;br /&gt;
&lt;br /&gt;
==Write-Miss Policy Combinations==&lt;br /&gt;
&lt;br /&gt;
In practice, these policies are used in combination to provide an over-all write policy.  Four combinations of these three write miss policies are relevant, as illustrated in the table below:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[image:policies.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram.  They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache.  Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time.&lt;br /&gt;
&lt;br /&gt;
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write.  They result in 'eliminated misses' when compared to a fetch-on-write policy.  In general, this will yield better cache performance if the overhead to manage the policy remains low. &lt;br /&gt;
&lt;br /&gt;
The following discusses each of these combinations:&lt;br /&gt;
&lt;br /&gt;
===Write-Validate===&lt;br /&gt;
The combination of no-fetch-on-write and write-allocation is referred to as 'write-validate'.  It writes the data into the cache line without fetching the corresponding block from memory first.  The assumption is that the block will be written to memory at a later time.  It requires additional overhead, or dirty bits, to track what bytes have been written into that cache line and which bytes were not written.  Lower level memories also must be able can process only the changed portions of these lines.  Otherwise, when the line is flushed to memory, the unwritten bytes may overwrite valid data.&lt;br /&gt;
&lt;br /&gt;
The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory.  For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block.  While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Around===&lt;br /&gt;
The combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit is referred to as a 'write-around'.  It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss.  This strategy shows performance improvements when the data that is written will not be reread in the near future.  Since we are writing before a hit is detected, the cache is written around for both hits and misses.&lt;br /&gt;
&lt;br /&gt;
The author notes that in only but a few cases write-around performs worse than write-validate policies.  Most applications tend to reread what they have recently written.  Using a write-around policy, this would result in a cache miss and a read from lower-level memory.  With write-validate, the data would be in cache.&lt;br /&gt;
&lt;br /&gt;
Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
The combination of write-before-hit, no-fetch-on-write, and no-write-allocate is referred to as 'write-invalidate' because the line is invalidated on the miss.  The copy that exists in lower level memory after the write miss differs from the one in the cache.  For write hits, though, the data is simply written into the cache using the cache hit policy.  Thus, for hits, the cache is not written around.&lt;br /&gt;
&lt;br /&gt;
Of these three combinations, write validate performed the worse.  The author notes, though, that it does perform better than fetch-on-write and is easy to implement.  Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.&lt;br /&gt;
&lt;br /&gt;
===Fetch-on-write===&lt;br /&gt;
When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected.  This policy was used as the base-line for determining the performance of the other three policy combinations.   In general, all three previous combinations exhibited fewer misses than this policy.&lt;br /&gt;
&lt;br /&gt;
=Prefetching in Contemporary Parallel Processors=&lt;br /&gt;
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache.  Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip.&lt;br /&gt;
&lt;br /&gt;
Prefetching is the process of retrieving instructions or data from memory before the process explicitly requests it.  Instruction prefetching is commonly used by single and multiprocessors to reduce process wait states.[[#References|[17]]]  Prefetching of data may also be used, though, to pre-populate caches with data that is likely going to be required by the processors in the near term.  If the data requirements are anticipated correctly, the requests to memory will result in a greater percentage of cache hit rates and, therefore, reduce overall memory access time.  If the prefetcher guesses wrong, bus traffic can increase unnecessarily, more relevant data can be flushed from caches, and miss rates can increase.&lt;br /&gt;
&lt;br /&gt;
Prefetching algorithms can leverage both temporal and spacial locality in making these decisions.  For example, streaming and sequential access applications often process adjacent memory locations in subsequent tasks.&lt;br /&gt;
&lt;br /&gt;
Following are two examples of processors that support prefetching directly in the chip design:&lt;br /&gt;
&lt;br /&gt;
== Intel Core i7  [[#References|[6]]]==&lt;br /&gt;
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.&lt;br /&gt;
&lt;br /&gt;
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently.  The processor assumes that this ascending access will continue, and prefetches the next line.&lt;br /&gt;
&lt;br /&gt;
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory:  one for the upstreams that has 12 entries, and one for downstreams that has 4 entires.  As pages are accessed, their addresses are tracked in these arrays.  When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.&lt;br /&gt;
&lt;br /&gt;
== AMD [[#References|[7]]]==&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache.  (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)&lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though.  When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory.   For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected.  Subsequent detection of misses of sequential blocks may only prefetch a single block.  This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.&lt;br /&gt;
&lt;br /&gt;
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream.  In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU.  AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing. &lt;br /&gt;
&lt;br /&gt;
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream.  In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.&lt;br /&gt;
&lt;br /&gt;
= References =&lt;br /&gt;
[1]: http://www.techarp.com/showarticle.aspx?artno=337 &amp;lt;br /&amp;gt;&lt;br /&gt;
[2]: http://en.wikipedia.org/wiki/SPARC &amp;lt;br /&amp;gt;&lt;br /&gt;
[3]: http://en.wikipedia.org/wiki/POWER7 &amp;lt;br /&amp;gt;&lt;br /&gt;
[4]: http://en.wikipedia.org/wiki/POWER5 &amp;lt;br /&amp;gt;&lt;br /&gt;
[5]: “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201. http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-91-12.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[6]: &amp;quot;Intel® 64 and IA-32 Architectures Optimization Reference Manual&amp;quot;, Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[7]: &amp;quot;Software Optimization Guide for AMD Family 10h and 12h Processors&amp;quot;, Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf &amp;lt;br /&amp;gt;&lt;br /&gt;
[8]: Yan Solihin, ''Fundamentals of Parallel Computer Architecture'' (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151&amp;lt;br /&amp;gt;&lt;br /&gt;
[9]: http://en.wikipedia.org/wiki/AMD_Phenom&amp;lt;br /&amp;gt;&lt;br /&gt;
[10]: http://en.wikipedia.org/wiki/Phenom_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[11]: http://en.wikipedia.org/wiki/Athlon_II&amp;lt;br /&amp;gt;&lt;br /&gt;
[12]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_64_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[13]: http://en.wikipedia.org/wiki/List_of_AMD_Athlon_X2_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[14]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[15]: http://en.wikipedia.org/wiki/List_of_Intel_Core_i3_microprocessors&amp;lt;br /&amp;gt;&lt;br /&gt;
[16]: http://www.intel.com/pressroom/kits/quickrefyr.htm&amp;lt;br /&amp;gt;&lt;br /&gt;
[17]: http://en.wikipedia.org/wiki/Instruction_prefetch&lt;/div&gt;</summary>
		<author><name>Eapotter</name></author>
	</entry>
</feed>