<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sbasu3</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sbasu3"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Sbasu3"/>
	<updated>2026-05-26T03:46:40Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60728</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60728"/>
		<updated>2012-03-29T04:28:30Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Dragon Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.18.57_PM.png Figure 2] depicts the state transition diagram of the Dragon protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.29.04_PM.png Table 1] details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-29_at_12.14.24_AM.png|thumb|upright=1|right|alt=A large clock tower and other buildings line a great river.|Table 2: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified.&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.19.25_PM.png Figure 3] depicts the state transition diagram of the Firefly protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-29_at_12.14.24_AM.png Table 2] details the additional notations in the state transition diagram.&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60727</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60727"/>
		<updated>2012-03-29T04:27:25Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Dragon Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.29.04_PM.png Figure 2] depicts the state transition diagram of the Dragon protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.29.04_PM.png Table 1] details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-29_at_12.14.24_AM.png|thumb|upright=1|right|alt=A large clock tower and other buildings line a great river.|Table 2: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified.&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.19.25_PM.png Figure 3] depicts the state transition diagram of the Firefly protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-29_at_12.14.24_AM.png Table 2] details the additional notations in the state transition diagram.&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60726</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60726"/>
		<updated>2012-03-29T04:24:55Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Dragon Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
depicts the state transition diagram of the Firefly protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-29_at_12.14.24_AM.png Table 2] details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-29_at_12.14.24_AM.png|thumb|upright=1|right|alt=A large clock tower and other buildings line a great river.|Table 2: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified.&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.19.25_PM.png Figure 3] depicts the state transition diagram of the Firefly protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-29_at_12.14.24_AM.png Table 2] details the additional notations in the state transition diagram.&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60725</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60725"/>
		<updated>2012-03-29T04:23:38Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Firefly Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-29_at_12.14.24_AM.png|thumb|upright=1|right|alt=A large clock tower and other buildings line a great river.|Table 2: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified.&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.19.25_PM.png Figure 3] depicts the state transition diagram of the Firefly protocol, and [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-29_at_12.14.24_AM.png Table 2] details the additional notations in the state transition diagram.&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60724</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60724"/>
		<updated>2012-03-29T04:21:36Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Firefly Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-29_at_12.14.24_AM.png|thumb|upright=1|right|alt=A large clock tower and other buildings line a great river.|Table 2: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Screen_Shot_2012-03-28_at_10.19.25_PM.png Figure 3] depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60713</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60713"/>
		<updated>2012-03-29T03:56:54Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*    Disadvantages of Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60712</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60712"/>
		<updated>2012-03-29T03:56:28Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Consideration of cache architecture issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60711</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60711"/>
		<updated>2012-03-29T03:55:57Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*    Disadvantages of Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60710</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60710"/>
		<updated>2012-03-29T03:55:23Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*    Disadvantages of Write-Invalidate Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60709</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60709"/>
		<updated>2012-03-29T03:54:40Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Adaptive coherence protocols */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60708</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60708"/>
		<updated>2012-03-29T03:54:14Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*    Disadvantages of Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60707</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60707"/>
		<updated>2012-03-29T03:53:26Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60706</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60706"/>
		<updated>2012-03-29T03:52:17Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Invalidate Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60705</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60705"/>
		<updated>2012-03-29T03:51:12Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Write-Invalidate Protocol==&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60704</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60704"/>
		<updated>2012-03-29T03:48:27Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''Write-Invalidate Protocol'''&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60703</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60703"/>
		<updated>2012-03-29T03:47:33Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Invalidate Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;Write-Update protocol:&amp;lt;/b&amp;gt; Some of the common updated protocols are Dragon and Firefly protocols.&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
===Dragon Protocol===&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.29.04_PM.png|thumb|upright=10.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Dragon Protocol]]&lt;br /&gt;
[[File:Screen Shot 2012-03-28 at 10.18.57 PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Figure 2 depicts the state transition diagram of the Dragon protocol]]&lt;br /&gt;
&amp;lt;dd&amp;gt;This protocol was first proposed by researchers at Xerox PARC for their Dragon multiprocessor system.&lt;br /&gt;
The Dragon protocol ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Dragon protocol consists of four states:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Exclusive (E): '''means that only one cache (this cache) has a copy of the block, and it has not been modified (the main memory is up-to-date)&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-clean (Sc): '''means that potentially two or more caches (including this one) have this block, and main memory may or may not be up-to-date&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared-modified(Sm):''' means that potentially two or more caches have this block, main memory is not up-to-date, and it is this cache's responsibility to update the main memory at the time this block is replaced from the cache; a block may be in this state in only one cache at a time; however it is quite possible that one cache has the block in this state, while others have it in shared-clean state&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Modified (M): '''signifies exclusive ownership as before; the block is modified and present in this cache alone, main memory is stale, and it is this cache's responsibility to supply the data and to update main memory on replacement.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Firefly Protocol===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.13_PM.png|thumb|upright=2.5|right|alt=A large clock tower and other buildings line a great river.|Table 1: Definitions of additional notations in the state transition diagram of Figure 3]]&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol was developed by DEC for microprocessor workstation development.&lt;br /&gt;
&amp;lt;dd&amp;gt;The Firefly protocol also ensures that data is always valid if the tag matches. Hence, there is no explicit invalid state even though it reserves a miss mode bit for compulsory misses. The Firefly protocol consists of the following three states:&lt;br /&gt;
[[File:Screen_Shot_2012-03-28_at_10.19.25_PM.png|thumb|upright=2.5|center|alt=A large clock tower and other buildings line a great river.|Figure 3: State diagram of the Firefly protocol.]]&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Valid (V): '''This block has a coherent copy of the memory. There is only one copy of the data in caches.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Dirty (D):''' The block is the only copy of the memory and it is incoherent. This is the only state that generates a write-back when the block is replaced in the cache.&lt;br /&gt;
&amp;lt;dd&amp;gt;'''Shared (S):''' This block has a coherent copy of the memory. The data may be possibly shared, but its content is not modified. Figure 3 depicts the state transition diagram of the Firefly protocol, and Table 1 details the additional notations in the state transition diagram.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 8.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 9.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60647</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60647"/>
		<updated>2012-03-29T01:05:29Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=Write-Update Protocol=&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60646</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60646"/>
		<updated>2012-03-29T00:47:53Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications also directly relate to the size of cache block. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60645</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60645"/>
		<updated>2012-03-29T00:43:27Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60644</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60644"/>
		<updated>2012-03-29T00:42:00Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Hardware architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60643</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60643"/>
		<updated>2012-03-29T00:41:43Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Consideration of cache architecture issue */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60642</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60642"/>
		<updated>2012-03-29T00:41:09Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Hardware architecture */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
===Hardware architecture===&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60641</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=60641"/>
		<updated>2012-03-29T00:40:21Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Update and adaptive coherence protocols on real architectures, and power considerations */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cache_Coherency_Generic.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 1.  Multiple Caches of Shared Resource''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.&lt;br /&gt;
&lt;br /&gt;
=Adaptive coherence protocols=&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
  &lt;br /&gt;
&amp;lt;center&amp;gt; [[File:line_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 2.  Finite state diagram of block''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
&amp;lt;center&amp;gt;[[File:subblock_protocol.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 3.  Finite state diagram of Sub-block''' &amp;lt;/center&amp;gt; &lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast to the Illinois MESI Protocol, which requires extra power cycle to maintain the extra states and additional logic, the subblock protocol reduces number of cache blocks, compared to any cache update protocols, and thus reduces power consumption.. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:image500.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 4.  (a-c) Gauss (128K cache) uses large cache block ''' &amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Gauss32.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:cholesky_128k.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 5.  (a-c) Cholesky (128 K cache) application uses large cache block'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:gauss_32pro.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:Cholesky.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:mp3d.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 6.  (a-c) MP3D (128 K caches) application which uses smaller block sizes.'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_1.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_2.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:results_summary_3.png]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; '''Figure 7.  Final Simulation Results'''&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
The following quiz is intended for the reader to benefit from this article and to help achieve a complete understanding of the research. There are a total of 10 multiple-choice questions.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
'''1) What are the two commonly used coherence protocols? '''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based &amp;amp; Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &amp;amp; Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Write-Invalidate &amp;amp; Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Both a and c&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''2) A coherency protocol is a protocol which maintains  __________      ____________ according to a specific ___________     ___________'''&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt;1)	System order / memory hierarchy&lt;br /&gt;
&amp;lt;dd&amp;gt;2)	Memory coherence / consistency model&lt;br /&gt;
&amp;lt;dd&amp;gt;3)	Bus hierarchy / system order&lt;br /&gt;
&amp;lt;dd&amp;gt;4)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''3) Increasing bus bandwidth is most commonly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and b&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''4) Increasing conflict and capacity cache missies is mostly seen in what protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Token-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Update &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Both a and d&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''5) Due to the good use of  _________  __________ some applications execute more quickly with large cache block size, where as others run better when cache block sizes are small by avoiding migratory data or ________  _________ between processors'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Spatial locality / false-sharing&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Equidistant locality /Sequential consistency &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Temporal locality/ true sharing &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Non of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''6) The Subblock Protocol consists of four states: Invalid, Clean Shared, Dirty and __________. Name the missing state?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Valid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Dirty Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''7) In the Subblock Protocol, all subblocks that are clean-shared may be written without a bus transaction. In what state is this achieved?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Valid Exclusive&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Invalid&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Clean Shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Dirty Shared&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''8) Which protocol maintains a counter and invalidation threshold for each cache block to overcome the drawback of Write-Invalidates’ write after read problem?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Snooping-based&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Write-Invalidate&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Subblock protocol &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Read-snarfing protocol&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	Write-Update &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''9) Based on the simulation in this article, even though it was stated that Subblock Protocol worked better than MESI protocol in specific executions, in what circumstances the MESI Protocol would work better than Subblock Protocol?'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	For larger block size since it MESI uses large block sizes&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	For effectively supply the block to other processors whose blocks were invalidated in the past&lt;br /&gt;
&amp;lt;dd&amp;gt;c)	MESI will update the need to broadcast when the data is actively shared&lt;br /&gt;
&amp;lt;dd&amp;gt;d)	For small block size because Subblock Protocol uses large transfer block&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''10) Why was it stated that Read-snarfing protocol achieved the utilization of lower bus rate and higher application speedups.'''&lt;br /&gt;
&amp;lt;dd&amp;gt;a)	Because instructions on any given processor execute in partial program order, but may not propagate in that order&lt;br /&gt;
&amp;lt;dd&amp;gt;b)	Because it’s effective in reducing the amount of data transfer and number of bus transaction &lt;br /&gt;
&amp;lt;dd&amp;gt;c)	Because Write Update reduces all types of cache misses relative to Write Invalidate, therefore Read-snarfing gains performance &lt;br /&gt;
&amp;lt;dd&amp;gt;d)	Because multiple caches can be in its Shared State simultaneously.&lt;br /&gt;
&amp;lt;dd&amp;gt;e)	None of the above&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59779</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59779"/>
		<updated>2012-03-19T01:15:47Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Subblock states */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic whereas this protocol reduces number of cache block refresh compare to any cache update protocols, which reduces power consumption. &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59776</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59776"/>
		<updated>2012-03-19T01:13:05Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Subblock states */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic where as this protocol reduces number of cache block refresh compare to any cache update protocol, it tends to reduce power consumprion &amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59774</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59774"/>
		<updated>2012-03-19T01:05:48Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Subblock states */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblock is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of subblock is as follows:''' &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59773</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59773"/>
		<updated>2012-03-19T01:04:06Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Block states */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblocks is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
 &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59772</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59772"/>
		<updated>2012-03-19T01:02:55Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Block states */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
&lt;br /&gt;
'''     Finite state diagram of block/ line states is as follows:'''&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblocks is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
 &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59767</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59767"/>
		<updated>2012-03-19T00:57:23Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Read-snarfing protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblocks is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
 &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	&lt;br /&gt;
     If(Tb &amp;gt; 1) {	&lt;br /&gt;
	Tb--;&lt;br /&gt;
     }&lt;br /&gt;
} else {&lt;br /&gt;
     If(R &amp;gt; Tb) {&lt;br /&gt;
	Tb++;&lt;br /&gt;
     }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59764</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59764"/>
		<updated>2012-03-19T00:53:58Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*   Disadvantages of Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblocks is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
 &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	If(Tb &amp;gt; 1) {	&lt;br /&gt;
		Tb--;&lt;br /&gt;
	}&lt;br /&gt;
} else {&lt;br /&gt;
	If(R &amp;gt; Tb) {&lt;br /&gt;
		Tb++;&lt;br /&gt;
	 }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59763</id>
		<title>CSC/ECE 506 Spring 2012/8b va</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8b_va&amp;diff=59763"/>
		<updated>2012-03-19T00:52:55Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*    Disadvantages of Write-Update Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Update and adaptive coherence protocols on real architectures, and power considerations=&lt;br /&gt;
 &lt;br /&gt;
===Introduction===&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In a [http://en.wikipedia.org/wiki/Shared_memory shared memory] - shared-bus multiprocessor system, cache coherency protocol maintains one of the important roles to propagate changes from one cache to the others. But in most of the cases, update and invalidate coherence protocols are the main source of bus contention that can lead to increased number of bus busy cycles, thus increasing program execution time because the processor may stall while its cache is waiting for the bus. To avoid the bus contention this article brings up some high performance multiprocessor based adaptive hybrid protocol strategies.&amp;lt;ref&amp;gt;[http://en.wikipedia.org/wiki/Shared_memory shared memory]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Coherence protocols===&lt;br /&gt;
In general terms, a coherency protocol is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Shared_memory shared memory]. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency model], while modern shared memory systems typically support the release consistency or weak consistency models. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt; Multiple Caches of Shared Resource &amp;lt;/b&amp;gt;&lt;br /&gt;
[[File:Cache_Coherency_Generic.png]]&lt;br /&gt;
&lt;br /&gt;
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.&lt;br /&gt;
Various models and protocols have been devised for maintaining cache coherence, such as MSI protocol, MESI (aka Illinois protocol), MOSI, MOESI, MERSI, MESIF, write-once, Synapse, Berkeley, Firefly and Dragon protocol&lt;br /&gt;
&lt;br /&gt;
For the purpose of the main study, this article will focus on the two commonly used coherence protocols: &amp;lt;b&amp;gt;Write-Invalidate (WI) and Write-Update (WU)&amp;lt;/b&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====Write-Invalidate Protocol=====&lt;br /&gt;
In a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_inval.html Write-Invalidate Protocol] a processor invalidates all other processors cache block and then updates its own cache block without further bus operations.&lt;br /&gt;
&lt;br /&gt;
=====Write-Update Protocol=====&lt;br /&gt;
Interchangeably,in a [http://cs.gmu.edu/cne/modules/dsm/purple/wr_update.html Write-Uptade Protocol],  a processor broadcasts updates to shared data to other caches so other cashes stay coherent.&lt;br /&gt;
&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;   &amp;lt;b&amp;gt;Disadvantages of Write-Invalidate Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Any update/write operation in a processor invalidates the shared cache blocks of other processors, forcing other caches to do the bus request to reload the new data that turns to increase high bus bandwidth. This can be worse if one processor frequently updates the cache and other processor stalls to read the same cache block. For a sequence '''n''' that writes in one processor and read from other processor, '''WI''' protocol makes '''n''' invalidate and '''n''' cache block read operations.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - White-Invalidate]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=====&amp;lt;dd&amp;gt;  &amp;lt;b&amp;gt;Disadvantages of Write-Update Protocol&amp;lt;/b&amp;gt;=====&lt;br /&gt;
&amp;lt;dd&amp;gt; Update protocol is advantageous in this case because it updates only the cache blocks n times. But update protocol sometimes refresh unnecessary data of other processors cache for too long, hence fewer cache blocks are available for more useful data. It tends to increase conflict and capacity cache misses.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Write-Update]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
====Consideration of cache architecture issue====&lt;br /&gt;
Execution of applications directly relate to the size of cache block too. &lt;br /&gt;
Some applications execute more quickly with large cache block size because they exhibit good spatial locality where as some applications run better when cache block sizes are small by avoiding migratory data or false sharing between processors. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the combination strategy like '''adaptive hybrid protocol''' can reduce nature of pathological behaviors of update and invalid protocols. This protocol should be applicable to a wide range of network characteristics and it should automatically adjust its behavior to achieve target goals in the face of changes in traffic patterns, node mobility and other network characteristics.  &lt;br /&gt;
&lt;br /&gt;
==Hardware architecture==&lt;br /&gt;
Most often, processor architecture maintains the same block size for both memory to cache, cache to memory transfer and coherence. &lt;br /&gt;
This approach uses different block sizes for transfer and coherence based on application requirements.&lt;br /&gt;
Normal cache has the following parameters: Capacity('''C'''), Block size ('''L''') and associativity ('''K'''). &lt;br /&gt;
But sector cache divides the cache blocks into subblocks of size “b”. Though a sector cache ('''C, L, K, b''') requires one extra state and some extra bus line to transmit bitmasks corresponding to the status of the subblocks in a particular block compare to normal cache('''C, L, K''') but maintains the same number of tag and state bits for '''L'''. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
   &lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
===Subblock protocol===&lt;br /&gt;
This snoopy-based protocol mitigate the features of  [http://en.wikipedia.org/wiki/MESI_protocol Illinois MESI protocol] and write policies with subblock validation to take the advantages of both small and large cache block size by using subblock.  &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
=====Block states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt; All subblocks are invalid&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Valid Exclusive:&amp;lt;/b&amp;gt; All valid subblocks in this block are not present in any other caches. All subblocks that are clean shared may be written without a bus &amp;lt;dd&amp;gt;transaction. Any subblocks in the block may be invalid. There also may be Dirty blocks which must be written back upon replesment.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; The block contains subblocks that are either Clean Shares or Invalid. The block can be replaced without a bus transaction.&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; Subblocks in this block may be in any state. There may be Dirty blocks which must be written back on replacement.&amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dt&amp;gt;&lt;br /&gt;
[[File:line_protocol.png]]&lt;br /&gt;
&lt;br /&gt;
=====Subblock states=====&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Invalid:&amp;lt;/b&amp;gt;  The subblocks is invalid&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Clean Shared:&amp;lt;/b&amp;gt; A read access to the block will succeed. Unless the block the subblock is a part of is in the Valid Exclusive state, a write to the subblock will force an invalidation transaction on the bus.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty Shared:&amp;lt;/b&amp;gt; The subblock is treated like a Clean Shared sbublock, except that it must be written back on replacement. At most one cache will have a given subblock in either the Dirty Shared or Dirty state.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;dd&amp;gt; &amp;lt;b&amp;gt;Dirty:&amp;lt;/b&amp;gt; The subblocks is exclusive to this cache. It must be written back on replacement. Read and write access to this subblock hit with no bus transaction.&lt;br /&gt;
 &lt;br /&gt;
[[File:subblock_protocol.png]]&lt;br /&gt;
   &lt;br /&gt;
Basic idea of this protocol is to do more data transfer between caches and less off-chip memory access.&lt;br /&gt;
In contrast with Illinois protocol, on read misses, shared cache block sends cached sublock and also all other clean and dirty shared subblockes in that block. &lt;br /&gt;
If the subblock is in the main memory, cache snooper pass the information of caches subblocks and memory will provide the requested subblock along with sublocks which are not currently cached.&lt;br /&gt;
On write to clean subblockes and write misses, snooper invalidates only the subblock to be written to avoid the penalties associated with false sharing.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;b&amp;gt;In contrast with Illinois protocol, this protocol requires extra power cycle to maintain the extra state and more logic.&amp;lt;/b&amp;gt; &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Read-snarfing protocol===&lt;br /&gt;
This is an enhancement to snoopy-cache coherence protocols that takes advantage of the inherent broadcast nature of the bus.&lt;br /&gt;
&lt;br /&gt;
In contrast with MESI protocol, if one processor wants to reload data in its cache due cache miss, read-snarfing protocol will effectively supply the block to other processors whose blocks were invalidated in the past. Only one read is required to restore the block to all caches which are invalidated.&lt;br /&gt;
&lt;br /&gt;
Protocol modifies the normal updated protocol by updating only those sub-blocks which are modified and update only need to broadcast when the data is actively shared. &lt;br /&gt;
Read-snarfing protocol maintains a counter and invalidation threshold (Tb) for each cache block “b” to overcome the drawback of WI’s write after read problem. Protocol predicts the number of write operations happens on a single cache block before a read request to the same cache block. Invalidation request is being broadcasted when the write counter reaches the Tb and protocol dynamically adjust the value of Tb based on the nature of program execution. &lt;br /&gt;
&lt;br /&gt;
Simple algorithm of Read-snarfing Random Walk protocol is as follows:&lt;br /&gt;
Initially Tb of each cache block b is set to 0. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
// Number of Write operation happens before being accessed by other processor&lt;br /&gt;
 If (most recent write run  &amp;gt;  R) {	If(Tb &amp;gt; 1) {	&lt;br /&gt;
		Tb--;&lt;br /&gt;
	}&lt;br /&gt;
} else {&lt;br /&gt;
	If(R &amp;gt; Tb) {&lt;br /&gt;
		Tb++;&lt;br /&gt;
	 }&lt;br /&gt;
}&lt;br /&gt;
&lt;br /&gt;
R = Invalidation Ratio which is (Ci + Cr) / Cu&lt;br /&gt;
Ci:  The cost in bus cycles of an invalidation transaction&lt;br /&gt;
Cu: The cost in bus cycles of an update transaction&lt;br /&gt;
Cr:  The cost in bus cycles of reading a cache block&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Block will be invalidated immediately with no wasted updates when Threshold reduces to 0.  &lt;br /&gt;
When block is actively shared, block is not invalidated by adjusting the Tb upward. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors]&amp;lt;/ref&amp;gt;&lt;br /&gt;
&amp;lt;dl&amp;gt;&lt;br /&gt;
&amp;lt;dd&amp;gt;&lt;br /&gt;
=====Example 1 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Suppose invalidation ratio (R) = 5&lt;br /&gt;
&amp;lt;dd&amp;gt;Current threshold block (Tb) = 3&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 4 times before it is accessed by other processor, according to the above logic, Tb will be 4.&lt;br /&gt;
&amp;lt;dd&amp;gt;This means Tb is at the best possible value and only update can be issues.&lt;br /&gt;
&lt;br /&gt;
=====Example 2 =====&lt;br /&gt;
&amp;lt;dd&amp;gt;Consider, R= 5 and Tb = 3 for a particular block&lt;br /&gt;
&amp;lt;dd&amp;gt;If the processor writes 10 times before it is accessed by other processor&lt;br /&gt;
&amp;lt;dd&amp;gt;Tb will be 2. (Decreased)&lt;br /&gt;
&amp;lt;dd&amp;gt;So the protocol can incur a cost of 2 updates, 1 invalidate and 1 reread.&lt;br /&gt;
&amp;lt;dd&amp;gt;After 2 more write, Tb will be 0 and invalidation will occur immediately. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;/dl&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Simulation result==&lt;br /&gt;
For all simulated results, the considered subblock size ('''b''') was '''8 byte''' and two-way set-associative ('''K = 2'''). In order to minimize simulation time, data was only simulated with relatively large caches ('''C = 128K''').&lt;br /&gt;
&lt;br /&gt;
First, below are the results for the two applications ('''Gauss and Cholesky''') that do well using large block sixes. Next, the data results for the three applications ('''MP3D, Topopt, and Pverify''') that performs better using smaller block sixes are listed below. Finally, the report on results running Barnes, for which the choice of block size is not very important. The results of the simulation verified and compared the protocol with both usual and sector caches with '''1, 4, 16 and 32 processors''' where usual cache uses Illinois protocol whereas sector cache uses read-snurfing protocol. &amp;lt;ref&amp;gt;[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Two techniques for improving performance on bus-based mu1tiprocessors - Simulation]&amp;lt;/ref&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:image500.png]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Gauss (128K cache) uses large cache block &lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Gauss32.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:cholesky_128k.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) Cholesky (128 K cache) application uses large cache block&lt;br /&gt;
&lt;br /&gt;
[[File:gauss_32pro.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: [[File:Cholesky.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:mp3d.png]]&lt;br /&gt;
&lt;br /&gt;
Fig: (a-c) MP3D (128 K cachees) application which uses smaller block sizes.&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_1.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_2.png]]&lt;br /&gt;
&lt;br /&gt;
[[File:results_summary_3.png]]&lt;br /&gt;
&lt;br /&gt;
==Conclusion==&lt;br /&gt;
This article used two techniques to improve application performance on bus-based multiprocessor. First technique was using subblock cache coherence protocol by implementing combination of large block and the subset with small cache blocks to take both, the advantages of spatial locality, and avoid false-sharing.   &lt;br /&gt;
The second technique was Read-snarfing in order to reduce the number of read misses caused by previous cache coherence protocol action.  &lt;br /&gt;
The simulation report above showed that subblock protocol works better than MESI protocol for the 64 byte cache block size, but for small block size Illinois protocol works better because subblock protocol uses large transfer block.&lt;br /&gt;
In addition, Read-snarfing protocol is effective in reducing the amount of data transfer and number of bus transaction which turns out the utilization of lower bus rate and higher application speedups. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
*[http://www.computer.org/portal/web/csdl/doi/10.1109/HPCA.1995.386536 Two techniques for improving performance on bus-based multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.google.com/url?sa=t&amp;amp;rct=j&amp;amp;q=&amp;amp;esrc=s&amp;amp;source=web&amp;amp;cd=1&amp;amp;ved=0CCIQFjAA&amp;amp;url=ftp%3A%2F%2Fftp.cs.washington.edu%2Ftr%2F1994%2F05%2FUW-CSE-94-05-02.PS.Z&amp;amp;ei=2n9kT8gjhPDSAaO2nb4P&amp;amp;usg=AFQjCNFFRgsJiBWjKAMOHcGcRL_vkkSqLg&amp;amp;sig2=aYWddXJdXsNNIFQ5U4zoqg Design of an Adaptive Cache Coherence Protocol for Large Scale Multiprocessors]&lt;br /&gt;
&lt;br /&gt;
*[http://www.lfbs.rwth-aachen.de/content/smi Shared Memory Interface]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59522</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59522"/>
		<updated>2012-03-06T22:40:51Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Universal read/write Bufferhttp://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;With the present day processor speeds increasing at a much faster rate than memory speeds&amp;lt;ref&amp;gt;http://www.cesr.ncsu.edu/solihin/Main.html&amp;lt;/ref&amp;gt;, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read to complete before resuming execution, the write operations do not have this requirement. This is where a [http://en.wikipedia.org/wiki/Write_buffer write buffer (WB)] comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. Use of a write buffer in this manner also frees up the cache to service read requests while the write is taking place. For a processor that operates at a speed much higher than the memory speed, this hardware scheme prevents performance bottlenecks caused by long-latency writes.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Write_buffer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Introduction ==&lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Uniprocessor_With_WB.png Figure 1] shows the cache-based single processor system with a write buffer. A write to be performed is pushed into the buffer implemented as a [http://en.wikipedia.org/wiki/FIFO FIFO] (First In - First Out) [http://en.wikipedia.org/wiki/Queue_(abstract_data_type) queue], which essentially ensures that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting [http://en.wikipedia.org/wiki/Instruction_level_parallelism Instruction Level Parallelism (ILP)], the writes may also be called out-of-order, provided there are some hardware/software protocols implemented to check the writes for any dependences that may exist in the instruction stream. &lt;br /&gt;
&lt;br /&gt;
A generic approach towards read-write accesses can be described as follows [http://people.cs.umass.edu/~weems/CmpSci635A/Lecture10/L10.24.html][http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html]:&lt;br /&gt;
* If a write is followed by a read request to the same memory location while the write is still in the buffer, the buffered value is returned&lt;br /&gt;
* If a write is followed by a write request to the same memory location while the write is still in the buffer, the earlier write is overridden and is updated with the new write value&lt;br /&gt;
&lt;br /&gt;
This follows that the write buffers not only save memory access overhead on the first write to a location, but also on the closely-spaced successive read-after-write and write-after-write sequences.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''Interested reader may read about the store buffer and its implementation in ARM Cortex-R series processors.&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Chdcahcf.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html&amp;lt;/ref&amp;gt;''&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2]. In a generic design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. As can be seen in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2], each processor makes a write by pushing it into its write buffer, and the write buffer completes the task of performing the write to the main memory or the cache. Consistency is maintained between ordering of memory accesses at individual processors in the same way as explained in the section [[#Write Buffers in Uni-processors|above]].&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer &amp;lt;ref name=&amp;quot;dubois&amp;quot;&amp;gt;[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
However, with multiple processors working on the same task, it is essential that the memory ordering be consistent not only at individual processor level, but also with respect to all other processors in the system as well. Consider the case where a write (STORE A) has been issued by processor P1 into the WB1, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). &lt;br /&gt;
&lt;br /&gt;
This is a classic problem in all multi-processor systems, termed as [http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] problem. It is the job of the designer to employ protocols that will take care of the sequential ordering of the instructions, and ensure that the writes made to any one of the caches are propagated and updated in all of the processor caches. With the addition of a write-buffer, we add another level to this problem, as we now have to ensure that pending write requests in a processor's buffer do not go unnoticed by other processors.&lt;br /&gt;
&lt;br /&gt;
The sequential ordering model used in the system holds a strong bearing on the write buffer management policy, as is explained in a [[#Sequential Consistency|later]] section. We will now take a look at the definitions of different cache coherence policies and then move on to study various hardware implementations of the write buffer and the policies employed.&lt;br /&gt;
&lt;br /&gt;
==Coherence Models==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;quot;A memory scheme is coherent if the value returned on a LOAD instruction is always the value given by the latest STORE instruction with the same address.&amp;quot;&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two conventional models followed to attain this objective are the snoopy-bus protocol and the directory-based coherence protocol&amp;lt;ref name=&amp;quot;chapter-two&amp;quot;&amp;gt;http://www.csl.cornell.edu/~heinrich/dissertation/ChapterTwo.pdf&amp;lt;/ref&amp;gt;. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. In the snoopy-bus protocol, every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. &lt;br /&gt;
&lt;br /&gt;
With the addition of write buffers in the system, it is now essential that a processor be aware of the memory transactions occurring not only at the caches of other processors, but also in their write buffers. The snoopy bus protocol can maintain coherence between multiple write buffers by communicating using one of the two protocols explained below. &lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
In this approach, a write request by a processor is first pushed onto its local write buffer, and then is broadcast on the bus for all other processors to see. Each processor updates its local cache and buffer with the new value of the data. This saves bus bandwidth in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data. This is disadvantageous, however, when a snooped write transaction is present in a processor's cache - it needs to be updated right away, stalling other read operations that may be happening on that cache.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, it is pushed onto the local write buffer, and an invalidate signal is sent to all the caches and buffers, asking for that data block to be made invalid, as the value is not updated to the latest one. This approach alleviates us of the problem presented by the update protocol, as a snooped cache block is invalidated right away and does not keep any read waiting on the cache. However, this may cause additional traffic on the bus as a result, and a separate bus for invalidation requests may be included in the design. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Write–update strategy is liable to keep and update unnecessary data in the cache - the least recently used [http://en.wikipedia.org/wiki/Cache_algorithms#Least_Recently_Used (LRU)] algorithm may evict the useful data from the cache and keep the redundant updated data. Most present day processors employ write-invalidate policy because of its ease of implementation&amp;lt;ref name=&amp;quot;chapter-two&amp;quot; /&amp;gt;. In this article, for write-propagation of all hardware and software based designs we follow the write-invalidate strategy.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer as seen in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Unique_Buffer_Per_Processor.png Figure 3] that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Separate_Invalidate_Buffers.png Figure 4], every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on)&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Universal_Read_Write_Buffer.png Figure 5]). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserts the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====Structural Components=====&lt;br /&gt;
&lt;br /&gt;
The '''Data Buffer''' consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
The '''Address Buffer''' also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read access to main memory. Snoop logic compares the read request address with both write-back FIFO and line-fill FIFO. If there is a match, the HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non_block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO from filling up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, the read access has priority over the write access, so whenever the memory bus is available, read address will get direct possession of the memory bus.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The '''Local Bus Controller ''' monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determines that a transaction needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Controller Algorithm'''&lt;br /&gt;
&lt;br /&gt;
''For the MSI and MOSI protocols:''&lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
:If there is a write miss with '''M''' status on the local bus, write back cycle will be performed -'''WB''' is asserted.&lt;br /&gt;
:If there is a read miss or write miss from any of the protocol '''LF''' cycle will be performed -'''LF''' is asserted.&lt;br /&gt;
&lt;br /&gt;
''For the MOESI protocol:''&lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI as shown in [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Universal_Read_Write_Buffer.png Figure 5], but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
The ''''S'''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transactions. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''The algorithms for the '''MOESI''' protocol are as follows:''&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''The algorithms for the '''MSI''' and '''MESI''' protocols are as follows:''&lt;br /&gt;
''(These protocols have no '''O''' state)''&lt;br /&gt;
&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role in reducing the bus access and improve the performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
==Maintaining Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes is preserved between the multiple processors. &lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&amp;lt;ref name=&amp;quot;node6&amp;quot;&amp;gt;http://www.cs.jhu.edu/~gyn/publications/memorymodels/node6.html&amp;lt;/ref&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
As mentioned earlier, the sequential ordering policy applied has a strong bearing on the write-buffer management policy. Buffer management refers to the order in which multiple buffer requests are treated. In most cases, the requests are treated in a strict FIFO order, while in some cases requests may be allowed to pass each other in the buffer. This is referred to as jockeying&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;. Jockeying is often permitted between memory requests for different memory words, but is not permitted between requests with the same memory word. This approach is called restricted jockeying&amp;lt;ref name=&amp;quot;dubois&amp;quot; /&amp;gt;. &lt;br /&gt;
&lt;br /&gt;
We will discuss some ordering approaches, and explore how they are significant to write buffer coherence management. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
In strong ordering, the requirements for a system are as follows&amp;lt;ref&amp;gt;http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf&amp;lt;/ref&amp;gt;:&lt;br /&gt;
#All memory operations appear to execute one at a time&lt;br /&gt;
#All memory operations from a single CPU appear to execute in-order&lt;br /&gt;
#All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
Serialization is ensured on a single processor by keeping all the accesses to the memory consistent and in order. This implies that there will be no jockeying allowed within the write buffer, and any write access to shared data will result in invalidation of all corresponding copies in the caches and buffers. Strong ordering thus enforces strong sequential consistency by strictly serializing local accesses and communicating shared accesses via invalidations. The serialization, however, imposes a penalty on the efficiency of the processor, as the instructions have to be performed in serial fashion for most of the time.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
In weakly ordered systems, processors can issue shared memory accesses without waiting for previous accesses to be performed. Jockeying is allowed in the buffers, and accesses can pass each other out of serial program order. However, there is necessity to maintain consistency while accessing the synchronization variables. Synchronization variables are the variables that are responsible in the multi-program to maintain concurrency between processes running on separate processors&amp;lt;ref name=&amp;quot;node6&amp;quot; /&amp;gt;. Needless to say, as accesses can pass each other, the efficiency of the weakly ordered systems is much higher than the strongly ordered systems, but the implementation is much more complex, for the necessity to maintain the correctness of the program.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model which is a relaxed (weak) ordering model where the reads may bypass writes, but writes have to execute in program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
==Overcoming Buffer Stalls&amp;lt;ref name=&amp;quot;store-wait-free&amp;quot;&amp;gt;http://www.eecg.toronto.edu/~moshovos/research/store-wait-free.pdf&amp;lt;/ref&amp;gt;==&lt;br /&gt;
&lt;br /&gt;
Although the write buffer does the job of off-loading the write responsibility and hence the write access latency overhead from processor time, there are certain conditions where the processor performance cannot be improved a beyond a certain point. The processor system needs to be stalled if the write buffer overflows, for example. There are two types of stalls that mainly affect the processor performance : &lt;br /&gt;
#Buffer capacity related stalls : The buffer capacity related stalls occur when a write buffer overflows, and the processor cannot buffer any more stores. This occurs during store bursts, which are a number of stores that are requested in quick succession. The ''Scalable Store Buffer''&amp;lt;ref name=&amp;quot;store-wait-free&amp;quot; /&amp;gt; is a technique that aims at overcoming capacity-related stalls.&lt;br /&gt;
#Ordering related stalls : Modern processors have the ability to execute speculatively&amp;lt;ref&amp;gt;http://www.pcguide.com/ref/cpu/arch/int/featSpeculative-c.html&amp;lt;/ref&amp;gt;, which means that they can execute down a predicted path to save processor time. But there are time when the speculation could go wrong, and the state of the system has to be brought to a true state by draining out all requests down the bad path. This incurs a stall penalty on the processor, as no execution can proceed till the system is rolled back to a non-speculative, true state. ''Atomic Sequence Ordering''&amp;lt;ref name=&amp;quot;store-wait-free&amp;quot; /&amp;gt; is a technique that takes care of the ordering-related stalls.&lt;br /&gt;
&lt;br /&gt;
The two are explained here in brief.&lt;br /&gt;
&lt;br /&gt;
===Scalable Store Buffer (SSB)===&lt;br /&gt;
&lt;br /&gt;
The SSB as mentioned above is employed to overcome the buffer capacity-related stalls in a write-buffer based system. Conventional buffers follow Total Store Ordering&amp;lt;ref name=&amp;quot;store-wait-free&amp;quot; /&amp;gt; in which store values are forwarded to matching loads, and stores maintain total order of execution. The key architecture changes employed by the SSB are :&lt;br /&gt;
#The store forwarding is performed through L1 cache, so that the CAM look-up within the write buffer to match loads with stores can be done away with. &lt;br /&gt;
##Thus the L1 cache contains data values that are private to the processor&lt;br /&gt;
##Also, the stores are now buffered as a FIFO structure called the Total Store Order Buffer (TSOB)&lt;br /&gt;
#When the stores commit, they drain into the L2 cache, where the values are globally visible. All coherence requests are serviced by the L2 cache.&lt;br /&gt;
&lt;br /&gt;
By using this approach, the effective size of the write buffer becomes equivalent to the size of the L1 cache, making the chances of the store buffer overflowing, infinitesimally small. In event that such an overflow does happen, the stores are then stalled until outstanding stores drain out, making room for newer ones. &lt;br /&gt;
&lt;br /&gt;
===Atomic Sequence Ordering (ASO)===&lt;br /&gt;
&lt;br /&gt;
ASO aims at reducing the ordering-related stalls that occur in a multiprocessor. Ordering-related stalls frequently occur because of memory accesses being serviced out of order, so that the sequential consistency is not maintained - to retain the sequential consistency, the processors are forced to go into the stall state. Using the ASO approach, accesses are grouped into atomic sequences such that these sequences will always be accesses sequentially and all accesses will be atomic. Multiple such sequences may execute out of order, but the order of execution within a sequence is always obeyed. This provides us with a coarse-grain ordering of sequences, such that ordering stalls may now be avoided.&lt;br /&gt;
&lt;br /&gt;
ASO employs the technique of check-pointing the memory accesses, so that in case of a race condition, the state of the system can be restored. A race condition is defined as a specific sequence of memory accesses that violate the sequential ordering of a program. Using this technique, we can define the system to be in three distinct states as mentioned below:&lt;br /&gt;
&lt;br /&gt;
#''Accumulate'': In this state, a check point has been created and an atomic store sequence is being put together for execution. Once the size of the atomic sequence reaches a predetermined size, the sequence moves from the Accumulate state to the Await Permission state.&lt;br /&gt;
#''Await Permission'': In this state, the atomic sequence waits for all of the accesses to be granted permission for the store. Once the permissions for all of the instructions have arrived, the sequence then moves into the Commit state.&lt;br /&gt;
#''Commit'': In the commit state, all the writes in the sequence commit and are drained into the memory. The sequence is considered to be committed once all its writes are globally visible. &lt;br /&gt;
&lt;br /&gt;
A sequence can transition directly from the ''Accumulate'' state to the ''Commit'' state, if the write permissions for all the accesses have already arrived while the sequence was in the ''Accumulate'' state. &lt;br /&gt;
&lt;br /&gt;
Thus, we can see that ASO removes ordering constraints in the store accesses such that the stalls due to ordering constraints can be reduced to a minimum.&lt;br /&gt;
&lt;br /&gt;
==Conclusions and Observations==&lt;br /&gt;
&lt;br /&gt;
Although SSB and ASO are not essentially means by which write buffers communicate in the multiprocessor systems, these are the techniques that are necessary for complete utilization of write buffer hardware when using weak ordering approaches, like TSO. Interested reader may read more about these two schemes in &amp;lt;ref name=&amp;quot;store-wait-free&amp;quot; /&amp;gt;.&lt;br /&gt;
We can see from the above architectural techniques that using a write buffer in a cache-based multiprocessor system is much helpful in maximizing processor performance by off-loading the store latency to the write-buffer. With hardware and software techniques that take care of issues like the coherence between multiple buffers, capacity limitations of the write buffers and the ordering policy that is followed while implementation, write buffering can be a very strong technique to improve the performance of a multiprocessor system.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59384</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59384"/>
		<updated>2012-03-05T19:32:22Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Universal read/write Bufferhttp://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;With the present day processor speeds increasing at a much faster rate than memory speeds&amp;lt;ref&amp;gt;http://www.cesr.ncsu.edu/solihin/Main.html&amp;lt;/ref&amp;gt;, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read to complete before resuming execution, the write operations do not have this requirement. This is where a [http://en.wikipedia.org/wiki/Write_buffer write buffer (WB)] comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. Use of a write buffer in this manner also frees up the cache to service read requests while the write is taking place. For a processor that operates at a speed much higher than the memory speed, this hardware scheme prevents performance bottlenecks caused by long-latency writes.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Write_buffer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Introduction ==&lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Uniprocessor_With_WB.png Figure 1] shows the cache-based single processor system with a write buffer. A write to be performed is pushed into the buffer implemented as a [http://en.wikipedia.org/wiki/FIFO FIFO] (First In - First Out) [http://en.wikipedia.org/wiki/Queue_(abstract_data_type) queue], which essentially ensures that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting [http://en.wikipedia.org/wiki/Instruction_level_parallelism Instruction Level Parallelism (ILP)], the writes may also be called out-of-order, provided there are some hardware/software protocols implemented to check the writes for any dependences that may exist in the instruction stream. &lt;br /&gt;
&lt;br /&gt;
A generic approach towards read-write accesses can be described as follows [http://people.cs.umass.edu/~weems/CmpSci635A/Lecture10/L10.24.html][http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html]:&lt;br /&gt;
* If a write is followed by a read request to the same memory location while the write is still in the buffer, the buffered value is returned&lt;br /&gt;
* If a write is followed by a write request to the same memory location while the write is still in the buffer, the earlier write is overridden and is updated with the new write value&lt;br /&gt;
&lt;br /&gt;
This follows that the write buffers not only save memory access overhead on the first write to a location, but also on the closely-spaced successive read-after-write and write-after-write sequences.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''Interested reader may read about the store buffer and its implementation in ARM Cortex-R series processors.&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Chdcahcf.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html&amp;lt;/ref&amp;gt;''&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2]. In a generic design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. As can be seen in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2], each processor makes a write by pushing it into its write buffer, and the write buffer completes the task of performing the write to the main memory or the cache. Consistency is maintained between ordering of memory accesses at individual processors in the same way as explained in the section [[#Write Buffers in Uni-processors|above]].&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer &amp;lt;ref&amp;gt;[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
However, with multiple processors working on the same task, it is essential that the memory ordering be consistent not only at individual processor level, but also with respect to all other processors in the system as well. Consider the case where a write (STORE A) has been issued by processor P1 into the WB1, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). &lt;br /&gt;
&lt;br /&gt;
This is a classic problem in all multi-processor systems, termed as [http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] problem. It is the job of the designer to employ protocols that will take care of the sequential ordering of the instructions, and ensure that the writes made to any one of the caches are propagated and updated in all of the processor caches. With the addition of a write-buffer, we add another level to this problem, as we now have to ensure that pending write requests in a processor's buffer do not go unnoticed by other processors.&lt;br /&gt;
&lt;br /&gt;
The sequential ordering model used in the system holds a strong bearing on the write buffer management policy, as is explained in a [[#Sequential Consistency|later]] section. We will now take a look at the definitions of different cache coherence policies and then move on to study various hardware implementations of the write buffer and the policies employed.&lt;br /&gt;
&lt;br /&gt;
==Coherence Models==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;quot;A memory scheme is coherent if the value returned on a LOAD instruction is always the value given by the latest STORE instruction with the same address.&amp;quot;&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two conventional models followed to attain this objective are the snoopy-bus protocol and the directory-based coherence protocol&amp;lt;ref&amp;gt;http://www.csl.cornell.edu/~heinrich/dissertation/ChapterTwo.pdf&amp;lt;/ref&amp;gt;. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. In the snoopy-bus protocol, every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. &lt;br /&gt;
&lt;br /&gt;
With the addition of write buffers in the system, it is now essential that a processor be aware of the memory transactions occurring not only at the caches of other processors, but also in their write buffers. The snoopy bus protocol can maintain coherence between multiple write buffers by communicating using one of the two protocols explained below. &lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
In this approach, a write request by a processor is first pushed onto its local write buffer, and then is broadcast on the bus for all other processors to see. Each processor updates its local cache and buffer with the new value of the data. This saves bus bandwidth in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data. This is disadvantageous, however, when a snooped write transaction is present in a processor's cache - it needs to be updated right away, stalling other read operations that may be happening on that cache.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, it is pushed onto the local write buffer, and an invalidate signal is sent to all the caches and buffers, asking for that data block to be made invalid, as the value is not updated to the latest one. This approach alleviates us of the problem presented by the update protocol, as a snooped cache block is invalidated right away and does not keep any read waiting on the cache. However, this may cause additional traffic on the bus as a result, and a separate bus for invalidation requests may be included in the design. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Write–update strategy is liable to keep and update unnecessary data in the cache - the least recently used [http://en.wikipedia.org/wiki/Cache_algorithms#Least_Recently_Used (LRU)] algorithm may evict the useful data from the cache and keep the redundant updated data. Most present day processors employ write-invalidate policy because of its ease of implementation&amp;lt;ref&amp;gt;http://www.csl.cornell.edu/~heinrich/dissertation/ChapterTwo.pdf&amp;lt;/ref&amp;gt;. In this article, for write-propagation of all hardware and software based designs we follow the write-invalidate strategy.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Figure 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserts the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read access to main memory. Snoop logic compares the read request address with both write-back FIFO and line-fill FIFO. If there is a match, the HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non_block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO from filling up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, the read access has priority over the write access, so whenever the memory bus is available, read address will get direct possession of the memory bus.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determines that a transaction needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
:If there is a write miss with '''M''' status on the local bus, write back cycle will be performed -'''WB''' is asserted.&lt;br /&gt;
:If there is a read miss or write miss from any of the protocol '''LF''' cycle will be performed -'''LF''' is asserted.&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
==Maintaining Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&amp;lt;ref&amp;gt;http://www.cs.jhu.edu/~gyn/publications/memorymodels/node6.html&amp;lt;/ref&amp;gt;&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59383</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59383"/>
		<updated>2012-03-05T19:31:22Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Universal read/write Bufferhttp://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1 */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;With the present day processor speeds increasing at a much faster rate than memory speeds&amp;lt;ref&amp;gt;http://www.cesr.ncsu.edu/solihin/Main.html&amp;lt;/ref&amp;gt;, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read to complete before resuming execution, the write operations do not have this requirement. This is where a [http://en.wikipedia.org/wiki/Write_buffer write buffer (WB)] comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. Use of a write buffer in this manner also frees up the cache to service read requests while the write is taking place. For a processor that operates at a speed much higher than the memory speed, this hardware scheme prevents performance bottlenecks caused by long-latency writes.&amp;lt;ref&amp;gt;http://en.wikipedia.org/wiki/Write_buffer&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Introduction ==&lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
[http://expertiza.csc.ncsu.edu/wiki/index.php/File:Uniprocessor_With_WB.png Figure 1] shows the cache-based single processor system with a write buffer. A write to be performed is pushed into the buffer implemented as a [http://en.wikipedia.org/wiki/FIFO FIFO] (First In - First Out) [http://en.wikipedia.org/wiki/Queue_(abstract_data_type) queue], which essentially ensures that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting [http://en.wikipedia.org/wiki/Instruction_level_parallelism Instruction Level Parallelism (ILP)], the writes may also be called out-of-order, provided there are some hardware/software protocols implemented to check the writes for any dependences that may exist in the instruction stream. &lt;br /&gt;
&lt;br /&gt;
A generic approach towards read-write accesses can be described as follows [http://people.cs.umass.edu/~weems/CmpSci635A/Lecture10/L10.24.html][http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html]:&lt;br /&gt;
* If a write is followed by a read request to the same memory location while the write is still in the buffer, the buffered value is returned&lt;br /&gt;
* If a write is followed by a write request to the same memory location while the write is still in the buffer, the earlier write is overridden and is updated with the new write value&lt;br /&gt;
&lt;br /&gt;
This follows that the write buffers not only save memory access overhead on the first write to a location, but also on the closely-spaced successive read-after-write and write-after-write sequences.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''Interested reader may read about the store buffer and its implementation in ARM Cortex-R series processors.&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0363e/Chdcahcf.html&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0246a/Cjafdjgj.html&amp;lt;/ref&amp;gt;''&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2]. In a generic design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. As can be seen in the [http://expertiza.csc.ncsu.edu/wiki/index.php/File:Multiprocessor_with_WB.png Figure 2], each processor makes a write by pushing it into its write buffer, and the write buffer completes the task of performing the write to the main memory or the cache. Consistency is maintained between ordering of memory accesses at individual processors in the same way as explained in the section [[#Write Buffers in Uni-processors|above]].&lt;br /&gt;
&lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer &amp;lt;ref&amp;gt;[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
However, with multiple processors working on the same task, it is essential that the memory ordering be consistent not only at individual processor level, but also with respect to all other processors in the system as well. Consider the case where a write (STORE A) has been issued by processor P1 into the WB1, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). &lt;br /&gt;
&lt;br /&gt;
This is a classic problem in all multi-processor systems, termed as [http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] problem. It is the job of the designer to employ protocols that will take care of the sequential ordering of the instructions, and ensure that the writes made to any one of the caches are propagated and updated in all of the processor caches. With the addition of a write-buffer, we add another level to this problem, as we now have to ensure that pending write requests in a processor's buffer do not go unnoticed by other processors.&lt;br /&gt;
&lt;br /&gt;
The sequential ordering model used in the system holds a strong bearing on the write buffer management policy, as is explained in a [[#Sequential Consistency|later]] section. We will now take a look at the definitions of different cache coherence policies and then move on to study various hardware implementations of the write buffer and the policies employed.&lt;br /&gt;
&lt;br /&gt;
==Coherence Models==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;blockquote&amp;gt;&amp;quot;A memory scheme is coherent if the value returned on a LOAD instruction is always the value given by the latest STORE instruction with the same address.&amp;quot;&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The two conventional models followed to attain this objective are the snoopy-bus protocol and the directory-based coherence protocol&amp;lt;ref&amp;gt;http://www.csl.cornell.edu/~heinrich/dissertation/ChapterTwo.pdf&amp;lt;/ref&amp;gt;. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. In the snoopy-bus protocol, every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. &lt;br /&gt;
&lt;br /&gt;
With the addition of write buffers in the system, it is now essential that a processor be aware of the memory transactions occurring not only at the caches of other processors, but also in their write buffers. The snoopy bus protocol can maintain coherence between multiple write buffers by communicating using one of the two protocols explained below. &lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
In this approach, a write request by a processor is first pushed onto its local write buffer, and then is broadcast on the bus for all other processors to see. Each processor updates its local cache and buffer with the new value of the data. This saves bus bandwidth in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data. This is disadvantageous, however, when a snooped write transaction is present in a processor's cache - it needs to be updated right away, stalling other read operations that may be happening on that cache.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, it is pushed onto the local write buffer, and an invalidate signal is sent to all the caches and buffers, asking for that data block to be made invalid, as the value is not updated to the latest one. This approach alleviates us of the problem presented by the update protocol, as a snooped cache block is invalidated right away and does not keep any read waiting on the cache. However, this may cause additional traffic on the bus as a result, and a separate bus for invalidation requests may be included in the design. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Write–update strategy is liable to keep and update unnecessary data in the cache - the least recently used [http://en.wikipedia.org/wiki/Cache_algorithms#Least_Recently_Used (LRU)] algorithm may evict the useful data from the cache and keep the redundant updated data. Most present day processors employ write-invalidate policy because of its ease of implementation&amp;lt;ref&amp;gt;http://www.csl.cornell.edu/~heinrich/dissertation/ChapterTwo.pdf&amp;lt;/ref&amp;gt;. In this article, for write-propagation of all hardware and software based designs we follow the write-invalidate strategy.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses&amp;lt;ref&amp;gt;http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf&amp;lt;/ref&amp;gt;====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=342629&amp;amp;tag=1&amp;lt;/ref&amp;gt;====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in http://expertiza.csc.ncsu.edu/wiki/index.php/File:Uniprocessor_With_WB.png Figure 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserts the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read access to main memory. Snoop logic compares the read request address with both write-back FIFO and line-fill FIFO. If there is a match, the HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non_block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO from filling up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, the read access has priority over the write access, so whenever the memory bus is available, read address will get direct possession of the memory bus.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determines that a transaction needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
:If there is a write miss with '''M''' status on the local bus, write back cycle will be performed -'''WB''' is asserted.&lt;br /&gt;
:If there is a read miss or write miss from any of the protocol '''LF''' cycle will be performed -'''LF''' is asserted.&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
:If there is a read miss on the local bus -'''INTERVENE''' is asserted.&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
==Maintaining Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&amp;lt;ref&amp;gt;http://www.cs.jhu.edu/~gyn/publications/memorymodels/node6.html&amp;lt;/ref&amp;gt;&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59310</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59310"/>
		<updated>2012-03-04T17:19:56Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Invalidate */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
''''''''''Note:'''  Present day most processor implementation is based on Write-Invalidate strategy because it’s easy to implement and write–update strategy sometimes keeps and updates unnecessary data in the cache. Least recently Used (LRU) algorithm may evict the more recently used data from cache and keeps the redundant updated data. &lt;br /&gt;
:In this article write-propagation of all hardware and software based design follows the write-invalidate strategy.'' &lt;br /&gt;
'''''&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59309</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59309"/>
		<updated>2012-03-04T17:19:30Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /* Write-Invalidate */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
''''''''Note:'''  Present day most processor implementation is based on Write-Invalidate strategy because it’s easy to implement and write–update strategy sometimes keeps and updates unnecessary data in the cache. Least recently Used (LRU) algorithm may evict the more recently used data from cache and keeps the redundant updated data. &lt;br /&gt;
:In this article write-propagation of all hardware and software based design follows the write-invalidate strategy.'' &lt;br /&gt;
'''&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59308</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59308"/>
		<updated>2012-03-04T16:15:21Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written and outdated by other cache.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59307</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59307"/>
		<updated>2012-03-04T16:12:34Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::(Because the line will be written by other cache and outdated.)&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59306</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59306"/>
		<updated>2012-03-04T16:11:58Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
'''S''' state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no '''O''' state)&lt;br /&gt;
Algorithm for MSI protocol, '''M''' state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''. This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::Because the line will be written by other cache and outdated.&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59305</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59305"/>
		<updated>2012-03-04T16:09:34Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
“S” state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• '''M''' state provides data to other cache &lt;br /&gt;
:::• Changes its state from '''M''' to''' O'''&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with '''O''' state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MSI''' and '''MESI''' protocols are as follows. (These protocols have no O state)&lt;br /&gt;
Algorithm for MSI protocol, “M” state provides data to the other caches &amp;amp; main memory and also changes the state from '''M''' to '''S'''.  This transaction requires main memory access, but local bus translation logic plays a big role to reduce the bus access and improve performance. &lt;br /&gt;
::• Local Bus controller changes the status of MOESI to '''O''' instead of '''S'''&lt;br /&gt;
::•''' M''' of MSI or MESI changes to '''S''' and no update will occur on main memory.&lt;br /&gt;
::So, if same data is requested by other cache&lt;br /&gt;
:::• '''O''' of MOESI will provide the data instead of main memory.&lt;br /&gt;
::If write cycle initiated &lt;br /&gt;
:::If MOESI cache has a valid line with '''M''' or''' O'''  &lt;br /&gt;
::::MOESI cache will send out the data line on to local bus and change the state to '''I'''.&lt;br /&gt;
::::Because the line will be written by other cache and outdated.&lt;br /&gt;
::If the status is '''E''' or '''S'''&lt;br /&gt;
:::• No cache will involve of data transfer.&lt;br /&gt;
:::• Main memory will provide the data.&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59304</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59304"/>
		<updated>2012-03-04T15:51:01Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
“S” state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
The algorithms for the '''MOESI''' protocol are as follows:&lt;br /&gt;
::If read hit initiated by other cache&lt;br /&gt;
:::• M state provides data to other cache &lt;br /&gt;
:::• Changes its state from M to O&lt;br /&gt;
::If same data is hit again&lt;br /&gt;
:::• Cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
&lt;br /&gt;
::If there is a write cache misses and INTERVENE is not activated&lt;br /&gt;
:::• Generate a line-fill cycle&lt;br /&gt;
&lt;br /&gt;
The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59303</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59303"/>
		<updated>2012-03-04T15:40:55Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Algorithms  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithm followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the algorithms below will ensure that the memory accesses are carried out coherently.&lt;br /&gt;
&lt;br /&gt;
“S” state is common for all three protocols but this system takes the advantage of MESI and MOESI protocol by reducing memory bus transaction. In shared state(S), multiple caches have the same updated data but memory may or may not have a copy. This change requires the following protocol translation. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59302</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59302"/>
		<updated>2012-03-04T15:30:25Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::'''INTERVENE''' is asserted.&lt;br /&gt;
If there is a write miss with '''M''' status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate '''WB''' signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::'''LF''' cycle will be performed&lt;br /&gt;
::Generate '''LF''' signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::'''INTERVENE''' is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59301</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59301"/>
		<updated>2012-03-04T15:29:23Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
::INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus -&lt;br /&gt;
::Write back cycle will be performed&lt;br /&gt;
::Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
::LF cycle will be performed&lt;br /&gt;
::Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
::INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59300</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59300"/>
		<updated>2012-03-04T15:25:29Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus -&lt;br /&gt;
  Write back cycle will be performed&lt;br /&gt;
  Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
  LF cycle will be performed&lt;br /&gt;
  Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
  INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59299</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59299"/>
		<updated>2012-03-04T15:24:06Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus -&lt;br /&gt;
  INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus -&lt;br /&gt;
  Write back cycle will be performed&lt;br /&gt;
  Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol -&lt;br /&gt;
  LF cycle will be performed&lt;br /&gt;
  Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
  INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59298</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59298"/>
		<updated>2012-03-04T15:22:32Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus&lt;br /&gt;
Write back cycle will be performed&lt;br /&gt;
Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol&lt;br /&gt;
             	LF cycle will be performed&lt;br /&gt;
	Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
	INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59297</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59297"/>
		<updated>2012-03-04T15:21:52Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
  INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus&lt;br /&gt;
  Write back cycle will be performed&lt;br /&gt;
  Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol&lt;br /&gt;
             	LF cycle will be performed&lt;br /&gt;
	Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
	INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59296</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59296"/>
		<updated>2012-03-04T15:20:49Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Local Bus Controller  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.&lt;br /&gt;
&lt;br /&gt;
Algorithm to generate controller signals-&lt;br /&gt;
&lt;br /&gt;
For the MSI and MOSI protocols: &lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
	INTERVENE is asserted.&lt;br /&gt;
If there is a write miss with M status on the local bus&lt;br /&gt;
       	Write back cycle will be performed&lt;br /&gt;
               Generate WB signal.&lt;br /&gt;
If there is a read miss or write miss from any of the protocol&lt;br /&gt;
             	LF cycle will be performed&lt;br /&gt;
	Generate LF signal&lt;br /&gt;
&lt;br /&gt;
For the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
If there is a read miss on the local bus&lt;br /&gt;
	INTERVENE is asserted&lt;br /&gt;
&lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59295</id>
		<title>CSC/ECE 506 Spring 2012/6b am</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/6b_am&amp;diff=59295"/>
		<updated>2012-03-04T15:16:26Z</updated>

		<summary type="html">&lt;p&gt;Sbasu3: /*  Address Buffer  */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;==Introduction ==&lt;br /&gt;
&lt;br /&gt;
With the present day processor speeds increasing at a much faster rate than memory speeds, there arises a need that the data transactions between the processor and the memory system be managed such that the slow memory speeds do not affect the performance of the processor system. While the read operations require that the processor wait for the read operation to complete before resuming execution, the write operations do not have this requirement. This is where a write buffer (WB) comes into the picture, assisting the processor in writes, so that the processor can continue its operation while the write buffer takes complete responsibility of executing the write. &lt;br /&gt;
&lt;br /&gt;
===Write Buffers in Uni-processors===&lt;br /&gt;
&lt;br /&gt;
A write to be performed is put in a buffer implemented as a FIFO queue, so that the writes are performed in the order that they were called. In a uni-processor model, with the requirement and possibility of extracting Instruction Level Parallelism (ILP), the writes may also be called out-of-order, provided there are some hardware/ software protocols implemented to check the writes for any dependences that may exist in the instruction stream. The following figure shows the cache-based single processor system with a write buffer. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[Image:Uniprocessor With WB.png|thumb|center|400px|Figure 1.Uni-processor cache based system with write buffer ]]&lt;br /&gt;
&lt;br /&gt;
===Write Buffer Issues in Multiprocessors===&lt;br /&gt;
&lt;br /&gt;
In a multiprocessor system, the same design can be extended, as shown in the figure. In this design, each processor will have its own private cache and a write buffer corresponding to the cache. The caches are connected by the means of an interconnect with each other as well as with the main memory. &lt;br /&gt;
 &lt;br /&gt;
[[Image:Multiprocessor with WB.png|thumb|center|400px|Figure 2. Cache- based multiprocessor system with write buffer ]] &lt;br /&gt;
&lt;br /&gt;
As can be seen in the figure above, each processor makes a write by pushing it into the write buffer, and the write buffer completes the task of performing the write to the cache/ main memory. If another write is issued by the processor that modifies the same address as the earlier write, the former write value will be over-written with the new one in the write buffer. Similarly, if a read is issued to the same address, the processor will read the value from the write buffer rather than going into the cache or the main memory.&lt;br /&gt;
&lt;br /&gt;
===The Coherence Problem===&lt;br /&gt;
&lt;br /&gt;
Consider the case where a write (ST_A) has been issued by processor I into the WB_I, and the write is waiting to be executed. Since the write has not yet been performed to the cache or the main memory, the other processors do not have any knowledge about the changes made to address A. As a result, a read operation by another processor from address A will take the value from either its own cache, write buffer, or the main memory (depending on whether it hits or misses in the cache). It is the job of the designer to employ protocols that will take care that the sequential ordering of the instructions is maintained, and that the writes made to any one of the caches are propagated and updated in all of the processor caches. &lt;br /&gt;
&lt;br /&gt;
==Sequential Consistency==&lt;br /&gt;
&lt;br /&gt;
When operating as a part of a multiprocessor, it is not enough to check dependencies between the writes only at the local level. As data is shared and the course of events in different processors may affect the outcomes of each other, it has to be ensured that the sequence of writes and the data dependencies is preserved between the multiprocessor. &lt;br /&gt;
“A system is said to be sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operation of each individual processor appear in this sequence in the order specified by its program.”&lt;br /&gt;
There are different approaches to this requirement, based on the expected outcome of the program. The condition of sequential consistency and logical ordering may be relaxed as per the requirement of the program we are looking to parallelize. &lt;br /&gt;
&lt;br /&gt;
===Strong Ordering===&lt;br /&gt;
The requirements for strong ordering are as follows:&lt;br /&gt;
 &lt;br /&gt;
1) All memory operations appear to execute one at a time.&lt;br /&gt;
&lt;br /&gt;
2) All memory operations from a single CPU appear to execute in-order.&lt;br /&gt;
&lt;br /&gt;
3) All memory operations from different processors are “cleanly” interleaved with each other (serialization)&lt;br /&gt;
&lt;br /&gt;
===Total Store Ordering===&lt;br /&gt;
Requirements are as follows:&lt;br /&gt;
&lt;br /&gt;
1)	Relaxed Consistency where store must complete in-order but stores need not complete before a read to a given location takes place&lt;br /&gt;
&lt;br /&gt;
2)	Allows reads to bypass pending writes where writes MUST exit the store buffer in FIFO order.&lt;br /&gt;
&lt;br /&gt;
===Partial Store Ordering===&lt;br /&gt;
&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Even more relaxed consistency where stores to any given memory location complete in-order but stores to different locations may complete out of order and stores need not complete before a read to a given location takes place.&lt;br /&gt;
&lt;br /&gt;
===Weak Ordering===&lt;br /&gt;
Requirement is as follows:&lt;br /&gt;
&lt;br /&gt;
Really relaxed consistency where anything goes, except at barrier synchronization points, global memory state must be completely settled at each synchronization and memory state may correspond to any ordering of reads and writes between synchronization points.&lt;br /&gt;
&lt;br /&gt;
===Example===&lt;br /&gt;
Examples for Sequential Consistency:&lt;br /&gt;
In the following program, consider variables “a”, “b”, “flag1” and “flag2” are initialized with 0 and both processors (CPU1 and CPU 2) are sharing all the variables&lt;br /&gt;
   &lt;br /&gt;
    a = b = flag1 = flag2 = 0;		// initial value&lt;br /&gt;
    CPU1				CPU 2 &lt;br /&gt;
    Flag 1 = 1;				flag2 = 1;&lt;br /&gt;
    a = 1;				a = 2; &lt;br /&gt;
    r1 = a;				r3 = a;&lt;br /&gt;
    r2 = flag2;				r4 = flag1;&lt;br /&gt;
&lt;br /&gt;
SPARC V8 architecture follows the Total Store Ordering model and allows a write following by a read to complete out of program order. Possible result we can get: r1 = 1, r3 = 2, r2 = r4 = 0&lt;br /&gt;
But strong ordering enforces strict atomicity thus we will get different results based on the execution order between two processors’ instructions.&lt;br /&gt;
&lt;br /&gt;
===Effects on Write Buffer Operation===&lt;br /&gt;
&lt;br /&gt;
// Explain here what ordering has to do with the write buffer coherency&lt;br /&gt;
&lt;br /&gt;
==Cache Coherence Models==&lt;br /&gt;
&lt;br /&gt;
The two prominent models to maintain cache coherence are the snoopy-bus protocol and the directory-based coherence protocol. The directory-based protocol is used for distributed memory systems, in which every processor keeps track of what data is being stored using a directory entry. The snoopy-bus protocol is in which every processor monitors the reads and writes that are being serviced by the memory bus. The read and write requests can be broadcast on the bus for all the processors to respond based on whether or not they are in possession of that data in their cache. We will be looking at shared memory systems in this article.&lt;br /&gt;
&lt;br /&gt;
=== Write-Update ===&lt;br /&gt;
&lt;br /&gt;
In this approach, a write request is broadcast to all the processors and each processor updates its local cache with the updated value of the data. Even though a read may miss in any of the processor local cache, the read can be made from any processor, as the copy has been updated in all caches. This saves a lot of bus bandwidth, in terms of writes, as there is only one broadcast that needs to be made in order for all the caches to be up to date with the newest value of the data.&lt;br /&gt;
&lt;br /&gt;
===Write-Invalidate===&lt;br /&gt;
This approach is common when dealing with systems where a data block is being written by one processor, but is being read by multiple processors. Whenever a write is made to shared data, an invalidate signal is sent to all the caches, asking for that data block to be made invalid, as the value is not updated to the latest one. This may cause additional traffic on the bus, as a result, a separate bus for invalidation requests may be included in the design.&lt;br /&gt;
&lt;br /&gt;
== Coherence in Write Buffers ==&lt;br /&gt;
&lt;br /&gt;
===Software-Based Coherence===&lt;br /&gt;
The software technique relies on the compiler to ensure that no dependencies exist between the STORE/LOAD accesses carried out at different processors on the shared memory. The compiler, based on indications from the programmer will make sure that there are no incoherent accesses to shared memory. There is a possibility of having a shared READ/WRITE buffer between all processors to access the main memory. &lt;br /&gt;
&lt;br /&gt;
===Hardware-Based Coherence===&lt;br /&gt;
In hardware-based protocols accesses to the shared memory are communicated using hardware invalidate signals that are broadcasted to all the processor memories. Hardware support may also be required for a LOAD to be broadcasted to all the caches, so that a read can be performed directly from a remote memory. There are two approaches to maintaining coherence in write buffers and caches for multiprocessor system-&lt;br /&gt;
&lt;br /&gt;
====Unique Buffer per Processor==== &lt;br /&gt;
[[Image:Unique Buffer Per Processor.png|thumb|right|250px|Figure 3: Cache based system with unique buffer per processor &lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
In this configuration, every processor has its own unique buffer that holds the loads and stores of the local processor as well as invalidates and loads of remote processors wanting access to the shared data. A store request in local cache will ensue invalidating of that data block in all other caches and main memory. A STORE is said to have been performed when &lt;br /&gt;
&lt;br /&gt;
# the request is serviced in the buffer and the cache is updated on a local hit and&lt;br /&gt;
# the request is issued to the buffer and an invalidate request has been sent to all other processor on the shared data.&lt;br /&gt;
&lt;br /&gt;
This can be achieved in two ways depending on the topology of the connection between the caches -&lt;br /&gt;
&lt;br /&gt;
# For a bus-based MP system, the LOADs that miss in the local cache and the STOREs that need invalidation in other caches are broadcast over the bus to all processor caches and memory. The STORE request is then considered performed with respect to all processors when the invalidation signals have been sent out to the private buffers of all the caches. &lt;br /&gt;
# For non-bus MP systems, the caches are connected in point-to-point mesh-based or interconnect ring-based topology. When a store is encountered in one of the processors, the shared data is first locked in the shared memory to ensure atomic access, and then the invalidate signal is propagated along the point-to-point interconnect in a decided order. The STORE to shared memory is performed only when the invalidation has been propagated to all the processor caches and buffers. If any processor issues a LOAD on the same address as that is being STOREd, then the LOAD request has to be rejected or buffered to be serviced at a later time, to maintain atomic access to the shared data block.&lt;br /&gt;
&lt;br /&gt;
====Separate Buffers for Local and Remote Accesses====&lt;br /&gt;
[[Image:Separate Invalidate Buffers.png|thumb|right|250px|Figure 4: Cache based system with separate write-invalidate buffers&lt;br /&gt;
[[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf]]]] &lt;br /&gt;
&lt;br /&gt;
Another approach to maintaining coherence is for every processor to have separate buffers for local data requests and remote data requests. As seen in the figure, every processor has a Local-buffer which queues the local LOADs and STOREs, and a Remote-buffer (also termed Invalidate buffer) that stores the invalidation requests and LOADs coming in from different processors.&lt;br /&gt;
In bus-based processor system, this approach makes it difficult to maintain write atomicity, as different invalidate buffers may hold different number of invalidation requests, putting uncertainty on the time to invalidate the concerned data block in the cache. In non-bus based systems, however, this approach is successful in maintaining strong coherence, provided the invalidate signal makes sure that the data is invalidated at a cache before moving on to the next processor (rather than just pushing the invalidate request in the buffers and moving on).&lt;br /&gt;
&lt;br /&gt;
====Universal read/write Buffer====&lt;br /&gt;
&lt;br /&gt;
[[Image:Universal Read Write Buffer.png|thumb|right|300px|Figure 5: Universal Read/Write Buffer [http://expertiza.csc.ncsu.edu/wiki/index.php/CSC/ECE_506_Spring_2012/6b_dm#cite_note-4 5]]]&lt;br /&gt;
&lt;br /&gt;
In this approach, a shared bus controller resides in between the local bus and memory bus, where local bus is connected to all private caches of the processors (illustrated in Fig 5). This technique supports multiple processors with same or different coherence protocols like; MSI, MESI and MOESI write invalidate strategy.  Shared bus controller consists of the local bus controller, the data buffer and the address buffer. Each private cache provides two bit tri-state signals to the local bus – '''INTERVENE''' and '''SHARE'''.&lt;br /&gt;
&lt;br /&gt;
# INTERVENE is asserted when any cache wants to provide a valid data to other caches.&lt;br /&gt;
# Cache controller asserted the SHARE when there is an address match with any of its own tag address.&lt;br /&gt;
&lt;br /&gt;
=====''' Data Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The data buffer consists of three FIFOs&lt;br /&gt;
&lt;br /&gt;
# The ''non_block write-FIFO'' receives non cacheable memory access data from CPU. These requests do not require any cache eviction or line-fill cycles. &lt;br /&gt;
# The ''write-back FIFO'' is used for eviction cycles. If there is a cache miss and eviction is required, cache miss data will be read (line-fill cycle) from memory to the cache just after the evicted data moved to the FIFO. This FIFO will be written back to memory later on. So, CPU will get the new data but eviction process time will be hidden from the CPU. &lt;br /&gt;
# The ''read-FIFO'' is used for storing data from non_block write FIFO, Write-back FIFO or memory. Data is temporarily stored in this FIFO until local bus is cleared and ready for new data.&lt;br /&gt;
&lt;br /&gt;
=====''' Address Buffer '''=====&lt;br /&gt;
&lt;br /&gt;
The address buffer also consists of three FIFOs &lt;br /&gt;
&lt;br /&gt;
# ''Non_block write FIFO'' - stores address of non-blocking accesses. The depth of this FIFO should be the same as the corresponding Data Buffer FIFO.&lt;br /&gt;
# ''Write-back FIFO'' - stores starting address of eviction cycle and holds the address until memory bus is free.&lt;br /&gt;
# ''Line-fill FIFO''- stores the starting address of line-fill cycle.&lt;br /&gt;
&lt;br /&gt;
Snoop logic is started whenever there is read cycle to main memory. Snoop logic compares the read request address with both write-back FIFO and Line-fill FIFO. If there is a match, HIT flag is set and CPU gets the data immediately through the internal bypass path. By doing so, stale data will not be read and the CPU doesn’t have to waste memory latency time.&lt;br /&gt;
&lt;br /&gt;
For non-block write cycle, snoop logic block compares the new requested address and Byte Enable bits with the previously stored non_block write FIFO. If there is an address match but no Byte-Enable overlapped, BGO signal will be asserted and pointer of the non_block write FIFO will not move forward to prevent the FIFO being filled up quickly.&lt;br /&gt;
&lt;br /&gt;
In this approach, Read cycle has priority over write cycle, so whenever memory bus is available, read address will get direct memory bus access.&lt;br /&gt;
&lt;br /&gt;
=====''' Local Bus Controller '''=====&lt;br /&gt;
&lt;br /&gt;
Controller monitors the local bus for status bits (M, O, E, S, I), INTERVENE# and SHARE# bits and determine that cycle needs to access main memory or private cache. For main memory access, controller informs the buffer with the signal of WB (Write Back), LF (Line Fill) or O3 (owned) status bit.  &lt;br /&gt;
The above outputs can be written in Boolean equations as shown in the example&amp;lt;ref&amp;gt;http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=00342629&amp;lt;/ref&amp;gt; below:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
      WB = RD * MOESI * INTV + WR * [STATUS=M]&lt;br /&gt;
      LF =  INTV&lt;br /&gt;
      03 = RD * MOESI * INTV&lt;br /&gt;
      MOESI means the cycle is initiated by MOESI protocol.&lt;br /&gt;
      INTV means INTERVENE signal which is asserted when any of cache provides a valid data to other cache.&lt;br /&gt;
&lt;br /&gt;
=====''' Algorithms '''=====&lt;br /&gt;
&lt;br /&gt;
The algorithms followed by each of the processors in a multiprocessor system is given below. The Processors may be following different protocols i.e. MSI/MESI/MOESI, but the following algorithms will ensure that the memory accesses are carried out coherently. &lt;br /&gt;
&lt;br /&gt;
''MOESI''    The algorithm followed for the MOESI protocol is as follows:&lt;br /&gt;
&lt;br /&gt;
#	M state provides data to other cache if read hit initiated by other cache and changes its state from M to O&lt;br /&gt;
#	If same data is hit again, cache with O state is responsible of providing data to the requesting cache.&lt;br /&gt;
#	Write cycle needs a line fill if there is a cache miss and no INTERVENE &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
''MSI'' and ''MESI''	The algorithms for the MSI and MOSI protocols are as follows. (MSI and MESI protocol have no O state)&lt;br /&gt;
&lt;br /&gt;
#	M of MSI or MESI protocol provides data to other cache and main memory and change the state from M to S.&lt;br /&gt;
#	Local bus controller is responsible to change the state from S to O for MOESI protocol &lt;br /&gt;
#	M of MSI or MESI still changes to S. &lt;br /&gt;
#	If write cycle initiated by other CPU (MSI or MESI),&lt;br /&gt;
::	If MOESI cache has a valid line with M or O status&lt;br /&gt;
:::  	It will send out the data line on to local bus and change the state to I.&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
:::	[Because the line will be written by other cache and outdated]&amp;lt;/blockquote&amp;gt;&lt;br /&gt;
&lt;br /&gt;
:5.	If the status is E or S, main memory will provide the data and no cache is involved in data transfer. However, cache with MOESI protocol will be the owner of a particular line.&lt;br /&gt;
&lt;br /&gt;
:6.	Force the status bit to be O instead of S&lt;br /&gt;
&lt;br /&gt;
=='''References'''==&lt;br /&gt;
&amp;lt;references/&amp;gt;&lt;br /&gt;
[http://mprc.pku.cn/mentors/training/ISCAreading/1986/p434-dubois/p434-dubois.pdf Memory Access Buffering In Multiprocessors] Michel Dubois, Christoph Scheurich, Faye Briggs&lt;br /&gt;
&lt;br /&gt;
[http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf Multiprocessor Consistency an Coherence] Memory System Architecture, Philip Koopman&lt;br /&gt;
&lt;br /&gt;
[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=528915 Write buffer design for cache-coherent shared-memory multiprocessors], Fernaz Mounes-Toussi, David J. Lilja&lt;/div&gt;</summary>
		<author><name>Sbasu3</name></author>
	</entry>
</feed>