<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Pxu</id>
	<title>Expertiza_Wiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.expertiza.ncsu.edu/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Pxu"/>
	<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=Special:Contributions/Pxu"/>
	<updated>2026-06-18T19:36:38Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59899</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59899"/>
		<updated>2012-03-19T02:50:08Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Real Architectures using MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
====CMP Implementation in Intel Architecture====&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
CMP implementation on the Intel Pentium M processor contains a unified on-chip L1 cache with the processor, an Memory/L2 access control unit, a prefetch unit, and a [http://en.wikipedia.org/wiki/Front-side_bus Front Side Bus (FSB)].  Processor requests are first sought in the L2 cache.  On a miss, they are forwarded to the main memory via FSB. The Memory/L2 access control unit serves as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  CMP implementation on the [http://en.wikipedia.org/wiki/Intel_Core Intel Core Duo] contains duplicated processors with on-chip L1 caches, L2 controller to handle all L2 cache requests and snoop requests, bus controller to handle data and I/O requests to and from the FSB, a prefetching unit, and a logical unit to maintain fairness between requests coming from each processors to L2 cache.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The Intel bus architecture has been evolving in order to accommodate the demands of scalability while using the same MESI protocol.  From using a single shared bus to '''dual independent buses (DIB)''', doubling the available bandwidth, and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using Intel QuickPath Interconnects using MESIF protocol.[[#References|&amp;lt;sup&amp;gt;[3]]]&lt;br /&gt;
&lt;br /&gt;
====ARM MPCore====&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
The state transition table is shown below.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Clean / Dirty'''&lt;br /&gt;
|  '''May Write?'''&lt;br /&gt;
|  '''May Forward?'''&lt;br /&gt;
|  '''May Transition To?'''&lt;br /&gt;
|-&lt;br /&gt;
|  M - Modified&lt;br /&gt;
|  Dirty&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  E - Exclusive&lt;br /&gt;
|  Clean&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  MSIF&lt;br /&gt;
|-&lt;br /&gt;
|  S - Shared&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  I&lt;br /&gt;
|-&lt;br /&gt;
|  I - Invalid&lt;br /&gt;
|  -&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  F - Forward&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes&lt;br /&gt;
|  SI&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
In MESIF protocol, only one agent can have a cache line in the F state at any given time; the other agents are allowed to have copies in S shate. Even when a cache line has been forwarded in this state, the home agent still needs to respond with a completion to allow retirement of the resources tracking the transaction.&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MOESI===&lt;br /&gt;
&lt;br /&gt;
The following systems implement MOESI coherence protocol:&lt;br /&gt;
&lt;br /&gt;
1. AMD64 architecture, including Operation, Athlon 64, Phenom processors.&lt;br /&gt;
2. HP AlphaServer GS320, which is based on Alpha EV68 microprocessors.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59895</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59895"/>
		<updated>2012-03-19T02:45:10Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Real Architectures using Synapse */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
====CMP Implementation in Intel Architecture====&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
CMP implementation on the Intel Pentium M processor contains a unified on-chip L1 cache with the processor, an Memory/L2 access control unit, a prefetch unit, and a [http://en.wikipedia.org/wiki/Front-side_bus Front Side Bus (FSB)].  Processor requests are first sought in the L2 cache.  On a miss, they are forwarded to the main memory via FSB. The Memory/L2 access control unit serves as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  CMP implementation on the [http://en.wikipedia.org/wiki/Intel_Core Intel Core Duo] contains duplicated processors with on-chip L1 caches, L2 controller to handle all L2 cache requests and snoop requests, bus controller to handle data and I/O requests to and from the FSB, a prefetching unit, and a logical unit to maintain fairness between requests coming from each processors to L2 cache.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====ARM MPCore====&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
The state transition table is shown below.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Clean / Dirty'''&lt;br /&gt;
|  '''May Write?'''&lt;br /&gt;
|  '''May Forward?'''&lt;br /&gt;
|  '''May Transition To?'''&lt;br /&gt;
|-&lt;br /&gt;
|  M - Modified&lt;br /&gt;
|  Dirty&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  E - Exclusive&lt;br /&gt;
|  Clean&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  MSIF&lt;br /&gt;
|-&lt;br /&gt;
|  S - Shared&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  I&lt;br /&gt;
|-&lt;br /&gt;
|  I - Invalid&lt;br /&gt;
|  -&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  F - Forward&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes&lt;br /&gt;
|  SI&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
In MESIF protocol, only one agent can have a cache line in the F state at any given time; the other agents are allowed to have copies in S shate. Even when a cache line has been forwarded in this state, the home agent still needs to respond with a completion to allow retirement of the resources tracking the transaction.&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MOESI===&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59894</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59894"/>
		<updated>2012-03-19T02:44:28Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
====CMP Implementation in Intel Architecture====&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
CMP implementation on the Intel Pentium M processor contains a unified on-chip L1 cache with the processor, an Memory/L2 access control unit, a prefetch unit, and a [http://en.wikipedia.org/wiki/Front-side_bus Front Side Bus (FSB)].  Processor requests are first sought in the L2 cache.  On a miss, they are forwarded to the main memory via FSB. The Memory/L2 access control unit serves as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  CMP implementation on the [http://en.wikipedia.org/wiki/Intel_Core Intel Core Duo] contains duplicated processors with on-chip L1 caches, L2 controller to handle all L2 cache requests and snoop requests, bus controller to handle data and I/O requests to and from the FSB, a prefetching unit, and a logical unit to maintain fairness between requests coming from each processors to L2 cache.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
====ARM MPCore====&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
The state transition table is shown below.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Clean / Dirty'''&lt;br /&gt;
|  '''May Write?'''&lt;br /&gt;
|  '''May Forward?'''&lt;br /&gt;
|  '''May Transition To?'''&lt;br /&gt;
|-&lt;br /&gt;
|  M - Modified&lt;br /&gt;
|  Dirty&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  E - Exclusive&lt;br /&gt;
|  Clean&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  MSIF&lt;br /&gt;
|-&lt;br /&gt;
|  S - Shared&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  I&lt;br /&gt;
|-&lt;br /&gt;
|  I - Invalid&lt;br /&gt;
|  -&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  F - Forward&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes&lt;br /&gt;
|  SI&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
In MESIF protocol, only one agent can have a cache line in the F state at any given time; the other agents are allowed to have copies in S shate. Even when a cache line has been forwarded in this state, the home agent still needs to respond with a completion to allow retirement of the resources tracking the transaction.&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59891</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59891"/>
		<updated>2012-03-19T02:42:42Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Real Architectures using MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
====CMP Implementation in Intel Architecture====&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
CMP implementation on the Intel Pentium M processor contains a unified on-chip L1 cache with the processor, an Memory/L2 access control unit, a prefetch unit, and a [http://en.wikipedia.org/wiki/Front-side_bus Front Side Bus (FSB)].  Processor requests are first sought in the L2 cache.  On a miss, they are forwarded to the main memory via FSB. The Memory/L2 access control unit serves as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  CMP implementation on the [http://en.wikipedia.org/wiki/Intel_Core Intel Core Duo] contains duplicated processors with on-chip L1 caches, L2 controller to handle all L2 cache requests and snoop requests, bus controller to handle data and I/O requests to and from the FSB, &lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
====ARM MPCore====&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocol''':&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
The state transition table is shown below.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Clean / Dirty'''&lt;br /&gt;
|  '''May Write?'''&lt;br /&gt;
|  '''May Forward?'''&lt;br /&gt;
|  '''May Transition To?'''&lt;br /&gt;
|-&lt;br /&gt;
|  M - Modified&lt;br /&gt;
|  Dirty&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  E - Exclusive&lt;br /&gt;
|  Clean&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  MSIF&lt;br /&gt;
|-&lt;br /&gt;
|  S - Shared&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  I&lt;br /&gt;
|-&lt;br /&gt;
|  I - Invalid&lt;br /&gt;
|  -&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  F - Forward&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes&lt;br /&gt;
|  SI&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59886</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59886"/>
		<updated>2012-03-19T02:34:54Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
====CMP Implementation in Intel Architecture====&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.[[#References|&amp;lt;sup&amp;gt;[3]&amp;lt;/sup&amp;gt;]]  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The Intel Pentium M processor contains a unified on-chip L1 cache with the processor, an L2 access control unit, a prefetch unit, and a [http://en.wikipedia.org/wiki/Front-side_bus Front Side Bus].&lt;br /&gt;
&lt;br /&gt;
====ARM MPCore====&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocol''':&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
The state transition table is shown below.&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Clean / Dirty'''&lt;br /&gt;
|  '''May Write?'''&lt;br /&gt;
|  '''May Forward?'''&lt;br /&gt;
|  '''May Transition To?'''&lt;br /&gt;
|-&lt;br /&gt;
|  M - Modified&lt;br /&gt;
|  Dirty&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  E - Exclusive&lt;br /&gt;
|  Clean&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  MSIF&lt;br /&gt;
|-&lt;br /&gt;
|  S - Shared&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  I&lt;br /&gt;
|-&lt;br /&gt;
|  I - Invalid&lt;br /&gt;
|  -&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  F - Forward&lt;br /&gt;
|  Clean&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes&lt;br /&gt;
|  SI&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59879</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59879"/>
		<updated>2012-03-19T02:26:23Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocol''':&lt;br /&gt;
&lt;br /&gt;
Invented by Intel, MESIF protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F). Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
MESIF protocol is implemented in the Intel QuickPath Interconnect (QPI), which is a high­ speed, packetized, point-to-point interconnect used in Intel's Core i series processor starting in 2008. The narrow high-speed links stitch together processors in a distributed shared memoryl-style platform architecture. Compared with front-side buses, it offers much higher bandwidth with low latency. QPI has an efficient architecture allowing more interconnect performance to be achieved in real systems. It has a snoop protocol optimized for low latency and high scalability, as well as&lt;br /&gt;
packet and lane structures enabling  quick completions of transactions. [[#References|&amp;lt;sup&amp;gt;[2]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] states that many processors rely on software to implement the prefetch special instruction.  Solihin[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]] also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.[[#References|&amp;lt;sup&amp;gt;[27]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. [[#References|[28]]] states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  [[#References|[28]]] suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;br /&gt;
# Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
# David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2, April 1990, pp. 218-230.&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59860</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59860"/>
		<updated>2012-03-19T02:16:37Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocol''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
SOURCE : Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
&lt;br /&gt;
SOURCE states that many processors rely on software to implement the prefetch special instruction.  SOURCE also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.^SOURCE&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
SOURCE : David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  SOURCE suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59859</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59859"/>
		<updated>2012-03-19T02:16:16Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Implementation Complexities */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
MSI protocol has one serious drawback for executing read-then-write sequences in programs. For each read-then-write, two bus transactions are involved: a BusRd and a BudRdX. All these operations waste a large amount of bandwidth. Therefore, most new machines do not implement MSI protocol. Instead, they use variants of the MSI protocol to reduce the amount of traffic in the coherency interconnect, such as MOSI and MESI protocols.&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely sequential programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong way has to be considered and optimized.[[#References|&amp;lt;sup&amp;gt;[26]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
SOURCE : Yan Solihin, &amp;quot;Fundamentals of Parallel Computer Architecture,&amp;quot; Solihin Publishing &amp;amp; Consulting LLC, 2008, pp. 168-173.&lt;br /&gt;
&lt;br /&gt;
SOURCE states that many processors rely on software to implement the prefetch special instruction.  SOURCE also mentions that effective prefecthing is characterized by '''coverage''' (the fraction of initial cache misses prefetch has transitioned into cache hits), '''accuracy''' (the fraction of prefetches that generated cache hits), and '''timeliness''' (how early the prefetch arrives).  [http://en.wikipedia.org/wiki/Prefetch_buffer Prefetch buffers] help prefetching obtain the desired characteristics.  [http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/com.ibm.db2.udb.doc/admin/c0005397.htm Sequential prefetching] detects and prefetches for accesses to contiguous locations, while [http://www.ics.uci.edu/~amrm/slides/amrm_structure/pta/tsld049.htm stride prefetching] detects and prefetches accesses that are s-cache block apart between consecutive accesses.^SOURCE&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
SOURCE : David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  SOURCE suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems.&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;br /&gt;
# [http://dictionary.reference.com/browse/instruction+prefetch prefetching]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59848</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59848"/>
		<updated>2012-03-19T01:59:15Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
When implementing the MOESI protocol on a real architecture like AMD K10 series, some modification or optimization was made to the protocol which allowed more efficient operation for some specific program patterns. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
SOURCE : David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  Prefetching can actually produce stalls on individual CPUs of multiprocessor systems, as the hit ratio benefit will only apply to the thread currently benefiting from the prefetch, and it will have to wait for other threads to catch up.  SOURCE suggests that overlapping prefetching decision-making time with user process time as much as possible to minimize the impact of the overall execution sequence as one attempt to implement prefetching within multiprocessor systems. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59842</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59842"/>
		<updated>2012-03-19T01:53:27Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESIF===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59841</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59841"/>
		<updated>2012-03-19T01:53:06Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MSI===&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol was first used in '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' IRIS 4D series. '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''' produced a broad range of '''[http://en.wikipedia.org/wiki/MIPS_architecture MIPS]'''-based (Microprocessor without Interlocked Pipeline Stages) workstations and servers during the 1990s, running '''[http://en.wikipedia.org/wiki/Silicon_Graphics SGI]''''s version of UNIX System V, now called '''[http://en.wikipedia.org/wiki/IRIX IRIX]'''. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics superworkstation. The unprecedented level of computing and graphics processing in an office-environment workstation was made possible by the fastest available Risc microprocessors in a single shared memory multiprocessor design driving a tightly coupled, highly parallel graphics system. Aggregate sustained data rates of over one gigabyte per second were achieved by a hierarchy of buses in a balanced system designed to avoid bottlenecks.&lt;br /&gt;
&lt;br /&gt;
The multiprocessor bus used in 4D-MP graphics superworkstation is a pipelined, block transfer bus that supports the cache coherence protocol as well as providing 64 megabytes of sustained data bandwidth between the processors, the memory and I/O system, and the graphics subsystem. Because the sync bus provides for efficient synchronization between processors, the cache coherence protocol was designed to support efficient data sharing between processors. If a cache coherence protocol has to support synchronization as well as sharing, a compromise in the efficiency of the data sharing protocol may be necessary to improve the efficiency of the synchronization operations. Hence it uses the simple cache coherence protocol which is the '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' protocol.&lt;br /&gt;
&lt;br /&gt;
With the simple rules of '''[http://en.wikipedia.org/wiki/MSI_protocol MSI]'''protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics superworkstation.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59839</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59839"/>
		<updated>2012-03-19T01:51:59Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59838</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59838"/>
		<updated>2012-03-19T01:51:48Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESIF */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59836</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59836"/>
		<updated>2012-03-19T01:51:30Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
David F. Kotz and Carla Schlatter Ellis, &amp;quot;Prefetching in File Systems for MIMD Multiprocessors,&amp;quot; in IEEE Transactions on Parallel and Distributed Systems, Vol. 1, No. 2,  April 1990, pp. 218-230.&lt;br /&gt;
&lt;br /&gt;
Prefetching's optimization is that it brings additional blocks into cache that are surrounding blocks currently being hit by the processor. SOURCE states that simply increasing the hit ratio does not automatically improve overall performance in multiprocessor computation.  &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59833</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59833"/>
		<updated>2012-03-19T01:47:48Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59832</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59832"/>
		<updated>2012-03-19T01:47:18Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. [[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59831</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59831"/>
		<updated>2012-03-19T01:47:03Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Dragon Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59829</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59829"/>
		<updated>2012-03-19T01:46:45Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59828</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59828"/>
		<updated>2012-03-19T01:46:03Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59826</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59826"/>
		<updated>2012-03-19T01:45:31Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* References */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&amp;lt;ref&amp;gt;Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&amp;lt;/ref&amp;gt;&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59824</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59824"/>
		<updated>2012-03-19T01:44:01Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&amp;lt;ref&amp;gt;Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan&amp;lt;/ref&amp;gt;&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59822</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59822"/>
		<updated>2012-03-19T01:43:22Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.[ref]Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems. Solihin, Yan[/ref]&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59817</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59817"/>
		<updated>2012-03-19T01:41:37Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* Dragon Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
[[File:DragonNew.jpg|500px|thumb|center|Dragon state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors.&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
In standard [http://en.wikipedia.org/wiki/Pipeline_(computing) CPU pipelining], the optimization of instruction prefetching is used to minimize the time the CPU spends waiting on instructions being fetched from main memory.  Prefetching works perfectly on completely linear programs; but on programs with branches, other optimizations (like [http://en.wikipedia.org/wiki/Branch_predictor branch prediction]) have to be implemented, and the cost of branching the wrong has to be considered.  source:  http://dictionary.reference.com/browse/instruction+prefetch&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
In multiprocessor systems, prefetching comes at the cost of performance. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:MESInew.jpg&amp;diff=59816</id>
		<title>File:MESInew.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:MESInew.jpg&amp;diff=59816"/>
		<updated>2012-03-19T01:41:12Z</updated>

		<summary type="html">&lt;p&gt;Pxu: uploaded a new version of &amp;amp;quot;File:MESInew.jpg&amp;amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:DragonNew.jpg&amp;diff=59815</id>
		<title>File:DragonNew.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:DragonNew.jpg&amp;diff=59815"/>
		<updated>2012-03-19T01:40:59Z</updated>

		<summary type="html">&lt;p&gt;Pxu: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59806</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59806"/>
		<updated>2012-03-19T01:35:05Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:MESInew.jpg|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59805</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59805"/>
		<updated>2012-03-19T01:34:50Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MESI protocol is shown below.&lt;br /&gt;
&lt;br /&gt;
[[File:File.png|500px|thumb|center|MESI state transition diagram]] &lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:MESInew.jpg&amp;diff=59804</id>
		<title>File:MESInew.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:MESInew.jpg&amp;diff=59804"/>
		<updated>2012-03-19T01:34:33Z</updated>

		<summary type="html">&lt;p&gt;Pxu: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59799</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59799"/>
		<updated>2012-03-19T01:31:50Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition diagram for MOESI protocol is shown below. The top part shows the response to processor-side requests and the bottom part is the response to snooper-side requests. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59796</id>
		<title>CSC/ECE 506 Spring 2012/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/8a_fu&amp;diff=59796"/>
		<updated>2012-03-19T01:30:03Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:MOESI.jpg|300px|thumb|center|alt text]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;MOESI State Transition Diagram&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59795</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59795"/>
		<updated>2012-03-19T01:28:35Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
[[File:MSInew.jpg|600px|thumb|center|MSI state transition diagram]]]&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59793</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59793"/>
		<updated>2012-03-19T01:27:34Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MSInew.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:MOESI.jpg|500px|thumb|center|MOESI state transition diagram]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59791</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59791"/>
		<updated>2012-03-19T01:26:56Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MSInew.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[File:MOESI.jpg|300px|thumb|center|alt text]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;MOESI State Transition Diagram&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59790</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59790"/>
		<updated>2012-03-19T01:26:05Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MOESI */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MSInew.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MOESI.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;MOESI State Transition Diagram&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59789</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59789"/>
		<updated>2012-03-19T01:23:51Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The state diagram of the MSI protocol is shown below. Note that the left part shows the response to processor-side requests, and the right part shows the response to snopper-side requests.&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MSInew.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MOESI_State_Transition_Diagram.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; MOESI State transition Diagram&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59788</id>
		<title>CSC/ECE 506 Spring 2010/8a fu</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2010/8a_fu&amp;diff=59788"/>
		<updated>2012-03-19T01:21:22Z</updated>

		<summary type="html">&lt;p&gt;Pxu: /* MSI Protocol */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;=Introduction to bus-based cache coherence in real machines=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==SMP Architecture==&lt;br /&gt;
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. And the most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect.  In the case of multicore processors (&amp;quot;chip multiprocessors,&amp;quot; or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies.  This is called '''''cache coherence problem'''''. It is  critical to  achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Busbased SMP.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''coherence protocol''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherence protocols are as shown in ''Table 1'':&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''States'''&lt;br /&gt;
|  '''Access Type'''&lt;br /&gt;
|  '''Invariant'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  read, write&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Owned'''&lt;br /&gt;
|  read&lt;br /&gt;
|  all other caches in I or S state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  read&lt;br /&gt;
|  no other cache in M or E state&lt;br /&gt;
|-&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|  -&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=Snooping Protocols=&lt;br /&gt;
==MSI Protocol==&lt;br /&gt;
&lt;br /&gt;
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line in '''Modified (M) ,Shared (S)''' and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is invalid state. If the cache line is clean and is shared by more than one processor , it is marked '''shared'''. If cache line is dirty and the processor has exclusive ownership of the cache line, it is present in '''Modified''' state. BusRdx causes others to invalidate (demote) to '''I''' state. If it is present in '''M''' state in another cache, it will flush. A BusRdx, even if it causes a cache hit in '''S''' state, is promoted to '''M''' (upgrade) state.&lt;br /&gt;
&lt;br /&gt;
The following state transition diagram for MSI protocol explains the working of the protocol:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MSInew.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Synapse protocol==&lt;br /&gt;
From the state transition diagram of MSI, we observe that there is transition to state '''S''' from state '''M''' when a BusRd is observed for that block. The contents of the block is flushed to the bus before going to '''S''' state. It would look more appropriate to move to '''I''' state thus giving up the block entirely in certain cases. This choice of moving to '''S''' or '''I''' reflects the designer's assertion that the original processor is more likely to continue reading the block than the new processor to write to the block. In synapse protocol, used in the early Synapse multiprocessor, made this alternate choice of going directly from '''M''' state to '''I''' state on a BusRd, assuming the migratory pattern would be more frequent. More details about this protocol can be found in these papers published in late 1980's [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model] and [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
&lt;br /&gt;
In Synapse protocol '''M''' state is called '''D''' (Dirty) state. The following is the state transition diagram for Synapse protocol:&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Synapse1.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
===Real Architectures using Synapse===&lt;br /&gt;
Nothing can be found on an any actual architecture using the Synapse cache-coherence protocol.  From the abstract of the paper written by Steve Frank and Armond Inselberg in 1984, &amp;quot;[http://dl.acm.org/citation.cfm?id=1499317 Synapse Tightly Coupled Multiprocessors: A New Approach to Solve Old Problems]&amp;quot;, Synapse was theoretical.&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The [http://www.disi.unige.it/person/DelzannoG/CacheProtocol/synapse.hy The BABYLON Project] did one example of a Synapse cache coherence protocol with atomic synchronization actions.&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
In the Synapse system, memory contention is reduced by designing a processor cache employing a non-write-through algorithm to minimized bandwidth between cache and shared memory. The Synapse Expansion Bus includes an ownership level protocol between processor caches.[[#References|&amp;lt;sup&amp;gt;[24]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
==[http://en.wikipedia.org/wiki/MESI_protocol MESI]==&lt;br /&gt;
The main drawback of MSI is that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. Highly parallel programs that have little data sharing suffer the most from this.  MESI protocol solves this problem by introducing the '''Exclusive''' state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.&lt;br /&gt;
Let us briefly see how the MESI protocol works. For a more detailed version refer Solihin textbook pg. 215.&lt;br /&gt;
&lt;br /&gt;
MESI coherence protocol marks each cache line in of the Modified, Exclusive, Shared, or Invalid state. &lt;br /&gt;
* '''Invalid''' : The cache line is either not present or is invalid&lt;br /&gt;
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only&lt;br /&gt;
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line,exclusive of the memory also.&lt;br /&gt;
* '''Shared''' : The cache line is clean and is shared by more than one core/processor&lt;br /&gt;
&lt;br /&gt;
The MESI protocol works as follows: &lt;br /&gt;
A line that is fetched, receives '''E''', or '''S''' state depending on whether it exists in other processors in the system. A cache line gets the '''M''' state when a processor writes to it; if the line is not in '''E''' or '''M'''-state prior to writing it, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals term it, “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). A table is shown below to summarize the MESI protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  Yes&lt;br /&gt;
|  No&lt;br /&gt;
|-&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
|  out of date&lt;br /&gt;
|  valid&lt;br /&gt;
|  valid&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
|  No&lt;br /&gt;
|  No&lt;br /&gt;
|  Maybe&lt;br /&gt;
|  Maybe&lt;br /&gt;
|-&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The transition diagram from the lecture slides is given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MESI.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Real Architectures using MESI===&lt;br /&gt;
&lt;br /&gt;
Intel's [http://en.wikipedia.org/wiki/Pentium_Pro Pentium Pro] microprocessor, introduced in 1992, was the first Intel architecture microprocessor to support symmetric multiprocessing in various multiprocessor configurations.  SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect] which uses point-to-point interconnection technology based on distributed shared memory architecture.  It uses a modified version of MESI protocol called [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], which introduces the additional Forward state.  (The state diagram of MESI transitions that occur within the Pentium's data cache can be found on page 63 of [http://books.google.com/books?id=TVzjEZg1--YC&amp;amp;printsec=frontcover&amp;amp;source=gbs_ge_summary_r&amp;amp;cad=0#v=onepage&amp;amp;q&amp;amp;f=false Pentium Processor System Architecture: Second Edition] by Don Anderson, Tom Shanley, MindShare, Inc)&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The [http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php ARM11 MPCore] and [http://en.wikipedia.org/wiki/ARM_Cortex-A9_MPCore Cortex-A9 MPCore] processors support the MESI cache coherency protocol.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  ARM MPCore defines the states of the MESI protocol it implements as:&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
|-&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
|-&lt;br /&gt;
|  '''Copies in other caches'''&lt;br /&gt;
|  NO&lt;br /&gt;
|  NO&lt;br /&gt;
|  YES&lt;br /&gt;
|  -&lt;br /&gt;
|-&lt;br /&gt;
|  '''Clean or Dirty'''&lt;br /&gt;
|  DIRTY&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  CLEAN&lt;br /&gt;
|  -&lt;br /&gt;
|}&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
The coherency protocol is implemented and managed by the Snoop Control Unit (SCU) in the ARM MPCore, which monitors the traffic between local L1 data caches and the next level of the memory hierarchy. At boot time, each core can choose to partake in the coherency domain or not.  Unless explicit system calls bound a task to a specific core ([http://en.wikipedia.org/wiki/Processor_affinity processor affinity]), there are high chances that a task will at some point migrate to a different core, along with its data as it is used.  Migration of tasks is not efficiently implemented in literal MESI implementation, so the ARM MPCore offers two optimizations that allow for MESI compliance and migration of tasks: '''Direct Data Intervention (DDI)''' (in which the SCU keeps a copy of all cores caches’ tag RAMs. This enables it to efficiently detect if a cache line request by a core is in another core in the coherency domain before looking for it in the next level of the memory hierarchy)and '''Cache-to-cache Migration''' (where if the SCU finds that the cache line requested by one CPU present in another core, it will either copy it (if clean) or move it (if dirty) from the other CPU directly into the requesting one, without interacting with external memory).  These optimizations reduce memory traffic in and out of the L1 cache subsystem by eliminating interaction with external memories, which in effect, reduces the overall load on the interconnect and the overall power consumption.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
Other architectures that use the MESI cache-coherence protocol include the L2 cache of the IBM POWER4 processor[[#References|&amp;lt;sup&amp;gt;[20]&amp;lt;/sup&amp;gt;]], the L2 cache of the Intel Itanium 2 processor[[#References|&amp;lt;sup&amp;gt;[21]&amp;lt;/sup&amp;gt;]], and the Intel Xeon[[#References|&amp;lt;sup&amp;gt;[22]&amp;lt;/sup&amp;gt;]].&lt;br /&gt;
&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Implementation Complexities===&lt;br /&gt;
One implementation complexity already mentioned is the inefficiency of task migration in MESI that the ARM MPCore addresses.[[#References|&amp;lt;sup&amp;gt;[19]&amp;lt;/sup&amp;gt;]]  Another possible implementation complexity is found during the replacement of a cache line: one possible MESI implementation requires a message to be sent to main memory when a cache line is flushed (i.e. an E to I transition), as the line was exclusively in one cache before it was removed. It is possible to avoid this replacement message if the system is designed so that the flush of a modified (exclusive) line requires an acknowledgment from main memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives (to ensure the change is successfully propagated to memory).[[#References|&amp;lt;sup&amp;gt;[23]&amp;lt;/sup&amp;gt;]]&amp;lt;br/&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==MESIF==&lt;br /&gt;
Let us now walk through a briefing on the '''MESIF protocl''':&lt;br /&gt;
&lt;br /&gt;
The '''MESIF''' protocol, used in the latest Intel multi-core processors was introduced to '''accommodate the point-to-point''' links used in the QuickPath Interconnect. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. &lt;br /&gt;
As a solution to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions from F to S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to '''S'''. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.&lt;br /&gt;
All M to S state transition and E to S state transitions will now be from '''M to F''' and '''E to F'''.  &lt;br /&gt;
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. &lt;br /&gt;
&lt;br /&gt;
More information on the QuickPath Interconnect and MESIF protocol can be found at&lt;br /&gt;
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''&lt;br /&gt;
&lt;br /&gt;
==MOESI==&lt;br /&gt;
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was the AMD’s first-generation dual core which had 2 distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die.  Cache coherence produces bigger problems on such multiprocessors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel used the MESI protocol to handle cache coherence.  MESI came with the drawback of using much time and bandwidth in certain situations. &lt;br /&gt;
&lt;br /&gt;
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was the AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called '''“Owned”''' . MOESI addresses the bandwidth problem faced in MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data access will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''“Owned”''', it can provide other processors the modified data without or even before writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in '''&amp;quot;Owned&amp;quot;''' stays responsible to update the main memory later when the cache line is evicted.&lt;br /&gt;
&lt;br /&gt;
MOESI has become one of the most popular snoop-based protocols supported in the AMD64 architecture.  The AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.&lt;br /&gt;
&lt;br /&gt;
The five different states of the MOESI protocol are:&lt;br /&gt;
* '''Modified (M)''' : The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.&lt;br /&gt;
* '''Owned (O)'''   : The cache line has the most recent correct copy of the data . This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.  &lt;br /&gt;
* '''Exclusive (E)''' : A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.  &lt;br /&gt;
* '''Shared (S)''' : A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors. &lt;br /&gt;
* '''Invalid (I)''' : A cache line does not hold a valid copy of the data.&lt;br /&gt;
&lt;br /&gt;
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of the AMD 64-bit core]&lt;br /&gt;
&lt;br /&gt;
The following table summarizes the MOESI protocol:&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
{| class=&amp;quot;wikitable&amp;quot; border=&amp;quot;1&amp;quot;&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Cache Line State:'''&lt;br /&gt;
&lt;br /&gt;
|  '''Modified'''&lt;br /&gt;
&lt;br /&gt;
|  '''Owner''' &lt;br /&gt;
&lt;br /&gt;
|  '''Exclusive'''&lt;br /&gt;
&lt;br /&gt;
|  '''Shared'''&lt;br /&gt;
&lt;br /&gt;
|  '''Invalid'''&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''This cache line is valid?'''&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  Yes&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''The memory copy is…'''&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  out of date&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  valid&lt;br /&gt;
&lt;br /&gt;
|  -&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''Copies exist in caches of other processors?'''&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
&lt;br /&gt;
|  No&lt;br /&gt;
|  Yes (out of date values)&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|  Maybe&lt;br /&gt;
&lt;br /&gt;
|-&lt;br /&gt;
&lt;br /&gt;
|  '''A write to this line'''&lt;br /&gt;
&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
|  does not go to bus&lt;br /&gt;
&lt;br /&gt;
|  goes to bus and updates cache&lt;br /&gt;
&lt;br /&gt;
|  goes directly to bus&lt;br /&gt;
&lt;br /&gt;
|}&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
State transition for MOESI is as shown below : &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:MOESI_State_Transition_Diagram.jpg]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt; MOESI State transition Diagram&amp;lt;/center&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===Optimization techniques on MOESI===&lt;br /&gt;
&lt;br /&gt;
In real machines, using some optimization techniques on the standard cache coherence protocol used , improves the performance of the machine. For example [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share, uses the MOESI protocol with some optimization techniques incorporated. &lt;br /&gt;
&lt;br /&gt;
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The '''MOESI''' protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the '''‘M’''' state for such computing problems, we can achieve better performance.&lt;br /&gt;
&lt;br /&gt;
When the producer thread , writes a new entry, it allocates cache-lines in the '''modified (M)''' state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to '''owned (O)''' in the L3 cache and pulls down a '''shared (S)''' copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked '''‘O’''') cache line, it finds that it cannot, since a cache line marked '''‘O’''' by the previous consumer read does not have sufficient permission for a write request (in the MOESI protocol). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.&lt;br /&gt;
&lt;br /&gt;
Thus, it is preferable to keep the cache line in the '''‘M’''' state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still marked '''‘M’''', to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.&lt;br /&gt;
&lt;br /&gt;
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Dragon Protocol==&lt;br /&gt;
The '''[http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]''' is an update based coherence protocol which does not invalidate other cached copies like what we have seen in the coherence protocols so far. Write propagation is achieved by updating the cached copies instead of invalidating them.  But the Dragon Protocol does not update memory on a cache to cache transfer and delays the memory and cache consistency until the data is evicted and written back, which saves time and lowers the memory access requirements. Moreover only the written '''byte''' or the '''word''' is '''communicated''' to the '''other caches''' instead of the whole block which further '''reduces''' the '''bandwidth''' usage. It has the ability to detect dynamically, the sharing status of a block and use a write through policy for shared blocks and write back for currently non-shared blocks. The Dragon Protocol employs the following four states for the cache blocks: '''Shared Clean''', '''Shared Modified''', '''Exclusive''' and  '''Modified'''. &lt;br /&gt;
* '''Modified (M)''' and '''Exclusive (E)''' - these states have the same meaning as explained in the protocols above. &lt;br /&gt;
* '''Shared Modified (Sm)''' - Only one cache line in the system can be in the Shared Modified state. Potentially two or more caches    have this block and memory may or may not be up to date and this processor's cache had modified the block.&lt;br /&gt;
* '''Shared Clean (Sc)''' -  Potentially two or more caches have this block and memory may or may not be up to date (if no other cache has it in Sm state, memory will be up to date else it is not).&lt;br /&gt;
When a Shared Modified line is evicted from the cache on a cache miss only then is the block written back to the main memory in order to keep memory consistent. For more information on Dragon protocol, refer to Solihin textbook, page number 229. The state transition diagram has been given below for reference.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:Dragon.jpg]]]]&amp;lt;/center&amp;gt;&lt;br /&gt;
&lt;br /&gt;
The Dragon protocol implements snoopy caches that provided the appearance of a uniform memory space to multiple processors. Here, each cache listens to 2 buses: the processor bus and the memory bus. The caches are also responsible for address translation, so the processor bus carries virtual addresses and the memory bus carries physical addresses.  The Dragon system was designed to support 4 to 8 Dragon processors. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=[http://en.wikipedia.org/wiki/Instruction_prefetch Instruction Prefetching]=&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent. &lt;br /&gt;
&amp;lt;br /&amp;gt;&lt;br /&gt;
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching there may be problems with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:&lt;br /&gt;
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache ae invalidated.&lt;br /&gt;
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.&lt;br /&gt;
# Data in virtual-page A is accessed.&lt;br /&gt;
&lt;br /&gt;
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page-table update instruction.&lt;br /&gt;
&lt;br /&gt;
In order to prevent errors from occurring, there are special instructions provided by prefetching software which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.&lt;br /&gt;
&lt;br /&gt;
More information can be found about this in [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
&lt;br /&gt;
= CMP Implementation in Intel Architecture =&lt;br /&gt;
&lt;br /&gt;
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''Uniprocessor Architecture'''&lt;br /&gt;
&lt;br /&gt;
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache1.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In this structure we have,&lt;br /&gt;
* A unified on-chip '''L1 cache''' with the '''processor/core''',&lt;br /&gt;
* A '''Memory/L2 access control unit''', through which all the accesses to the L2 cache, main memory and IO space are made,&lt;br /&gt;
* The second level '''L2 cache''' along with the '''prefetch unit''' and&lt;br /&gt;
* '''Front side bus (FSB)''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. &lt;br /&gt;
&lt;br /&gt;
As Intel explains it, using this structure, the processor requests were first sought in the '''L2 cache''' and only on a '''miss''', were they '''forwarded''' to the main '''memory''' via the front side bus ('''FSB'''). The '''Memory/L2 access control''' unit served as a central point for '''maintaining coherence''' within the core and with the external world. It '''contains''' a '''snoop control unit''' that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
'''CMP Architecture'''&lt;br /&gt;
&lt;br /&gt;
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the '''directory protocol''' because though directory protocol reduces the active power due to reduced snoop activity, it '''increased''' the '''design complexity''' and the '''static power''' due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power.&lt;br /&gt;
Let us examine how '''CMP''' was implemented in '''Intel Core Duo''', which was one of the first dual-core processor for the budget/entry-level market. &lt;br /&gt;
The general CMP implementation structure of the Intel Core Duo is shown below&lt;br /&gt;
&lt;br /&gt;
&amp;lt;center&amp;gt;[[Image:intel_cache2.jpg]]&amp;lt;/center&amp;gt; &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
This structure has the following changes when compared to the uniprocessor memory cluster structure. &lt;br /&gt;
* '''L1 cache''' and the '''processor/core''' structure is '''duplicated''' to give 2 cores.&lt;br /&gt;
* The '''Memory/L2 access control''' unit is '''split''' into 2 logical units: '''L2 controller''' and '''bus controller'''. The L2 controller handles all '''requests to the L2''' cache from the core and the snoop requests from the FSB. The '''bus controller''' handles '''data and I/O requests''' to and from the FSB.&lt;br /&gt;
* The '''prefetching''' unit is extended to handle the hardware '''prefetches for each core separately'''.&lt;br /&gt;
* A '''new logical unit''' (represented by the hexagon) was added to maintain '''fairness between the requests''' coming from the different cores and hence balance the requests to L2 and memory.&lt;br /&gt;
&lt;br /&gt;
This new '''partitioned structure''' for the  memory/L2 access control unit '''enhanced''' the '''performance''' while '''reducing power consumption'''. &lt;br /&gt;
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to &lt;br /&gt;
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
&lt;br /&gt;
The '''Intel bus architecture''' has been '''evolving''' in order to accommodate the demands of scalability while using the same MESI protocol; From using a '''single shared bus''' to '''dual independent buses (DIB)''' doubling the available bandwidth and to the logical conclusion of DIB with the introduction of '''dedicated high-speed interconnects (DHSI)'''. The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel QuickPath Interconnects using MESIF protocol'''.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
=Word Invalidation=&lt;br /&gt;
One complexity problem applying to a number of the protocols deals with invalidation.  In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line.  The other processors will thus have a mostly correct cache line, with only a word difference.  This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared.  This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line.  One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is &amp;quot;not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent.&amp;quot;  In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided.&lt;br /&gt;
&amp;lt;br/&amp;gt;Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.&lt;br /&gt;
&lt;br /&gt;
=Performance: MOESI vs MEI/MESI=&lt;br /&gt;
Early multiprocessors (such as the PowerPC processors) were designed to work with three states (Modified, Exclusive, Invalid - this is similar to the MSI protocol discussed earlier, the term MEI is used to remain consistent with the sources used for reference).  As time progressed, more multi-processors transitioned to the MESI protocol.  This is most likely due to the characteristics of MEI - it is &amp;quot;easy to implement but leads to inefficiencies in the way that memory bus bandwidth is used&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  As a result, systems with two or more MEI processors (most likely) will not see the 'full' potential of the extra processors, due to the inefficient bus transactions.&lt;br /&gt;
&lt;br /&gt;
In contrast, the five-state protocol MOESI is more complex to implement than MESI and MEI/MSI.  However, advantages arise from using it.  As Any Keane (VP of Marketing for PMC-Sierra) put it &amp;quot;this [fifth] state allows shared data that is dirty to remain in the cache.  Without this state, any shared line would be written back to memory to change the original state form modified. Since we have a dedicated, fast path from CPU to CPU, this state maximizes the use of this path rather than the path to memory, which is inherently slower&amp;quot; [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 source].  This advantage can be implemented by allowing MOESI to make some processor to processor transfers from level 2 caches.&lt;br /&gt;
&lt;br /&gt;
Source: [http://www.lexisnexis.com.www.lib.ncsu.edu:2048/hottopics/lnacademic/?shr=t&amp;amp;csi=155278&amp;amp;sr=HLEAD%28Architects+wrestle+with+multiprocessor+options%29+and+date+is+August%2C%202001 Edwards, Chris (08/06/2001). &amp;quot;Architects wrestle with multiprocessor options&amp;quot;. Electronic engineering times (0192-1541), (1178), p. 48.]&lt;br /&gt;
&lt;br /&gt;
=References=&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence]&lt;br /&gt;
# [http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]&lt;br /&gt;
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors]&lt;br /&gt;
# [http://www.zak.ict.pwr.wroc.pl/nikodem/ak_materialy/Cache%20consistency%20&amp;amp;%20MESI.pdf Cache consistency with MESI on Intel processor]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual]&lt;br /&gt;
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]&lt;br /&gt;
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=4913 Silicon Graphics Computer Systems]&lt;br /&gt;
# [http://books.google.com/books?id=g82fofiqa5IC&amp;amp;printsec=frontcover&amp;amp;dq=Parallel+computer+architecture:+a+hardware/software+approach+By+David+E.+Culler,+Jaswinder+Pal+Singh,+Anoop+Gupta&amp;amp;source=bl&amp;amp;ots=COrdamlfVn&amp;amp;sig=YcugVqbzTjHvlofvaFq6Ft_tjfY&amp;amp;hl=en&amp;amp;ei=0ZO6S4TJGcOclgejzI3BBw&amp;amp;sa=X&amp;amp;oi=book_result&amp;amp;ct=result&amp;amp;resnum=1&amp;amp;ved=0CAgQ6AEwAA#v=onepage&amp;amp;q=&amp;amp;f=false Parallel computer architecture: a hardware/software approach By David E. Culler, Jaswinder Pal Singh, Anoop Gupta]&lt;br /&gt;
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=1499317&amp;amp;dl=GUIDE&amp;amp;coll=GUIDE&amp;amp;CFID=83027384&amp;amp;CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems]&lt;br /&gt;
# [http://portal.acm.org/citation.cfm?id=6514Cache Coherence protocols: evaluation using a multiprocessor simulation model]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol]&lt;br /&gt;
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon]&lt;br /&gt;
# [http://thanaseto.110mb.com/courses/CSD-527-report-engl.pdf Coherence Protocols]&lt;br /&gt;
# [http://ieeexplore.ieee.org.www.lib.ncsu.edu:2048/stamp/stamp.jsp?tp=&amp;amp;arnumber=289691 XDBus]&lt;br /&gt;
# [http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dai0228a/index.html ARM]&lt;br /&gt;
# [http://ixbtlabs.com/articles/ibmpower4/index.html IBM POWER4]&lt;br /&gt;
# [http://www.chiplist.com/Intel_Itanium_2_processor/tree3f-section--2235-/ Intel Itanium 2]&lt;br /&gt;
# [http://techreport.com/articles.x/8236/2 MESI-MESI-MOESI Banana-fana...]&lt;br /&gt;
# [http://rsim.cs.illinois.edu/rsim/Manual/node109.html RSIM]&lt;br /&gt;
# [http://dl.acm.org/citation.cfm?id=1499317 Synapse tightly coupled multiprocessors]&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:MSInew.jpg&amp;diff=59785</id>
		<title>File:MSInew.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:MSInew.jpg&amp;diff=59785"/>
		<updated>2012-03-19T01:18:39Z</updated>

		<summary type="html">&lt;p&gt;Pxu: State diagram of MSI protocol&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;State diagram of MSI protocol&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:MOESI.jpg&amp;diff=59783</id>
		<title>File:MOESI.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:MOESI.jpg&amp;diff=59783"/>
		<updated>2012-03-19T01:18:04Z</updated>

		<summary type="html">&lt;p&gt;Pxu: State diagram of MOESI protocol&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;State diagram of MOESI protocol&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/3a_yw&amp;diff=58363</id>
		<title>CSC/ECE 506 Spring 2012/3a yw</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=CSC/ECE_506_Spring_2012/3a_yw&amp;diff=58363"/>
		<updated>2012-02-12T00:19:03Z</updated>

		<summary type="html">&lt;p&gt;Pxu: Created page with &amp;quot;Patterns of Parallel Programming --------------------------------------------------------------   ==Introduction== -------------------------------------------------------------- ...&amp;quot;&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Patterns of Parallel Programming&lt;br /&gt;
--------------------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Introduction==&lt;br /&gt;
--------------------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
The trend in general purpose microprocessor design has shifted from single-core to chip multiprocessor. The theoretical performance of a multi-core processor is much higher than that of a single-core processor running at similar clock speed. However, despite the rapid advances in hardware performance, the full potential of processing power is not being exploited in the community for one clear reason: difficulty of designing parallel software. &amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/concerns.pdf&amp;lt;/ref&amp;gt; Parallel programming requires much more effort than serial programming because the programmers need to find the parallelism that exists in an algorithm, and figure out a way to distribute the workload across multiple processors and still get the same result as the serial program. This turned out to be a difficult task. Identifying tasks, designing parallel algorithm, and managing the load balance among many processors has been a daunting task for novice programmers, and even the experienced programmers are often trapped with design decisions that result in suboptimal performance. &amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/concerns.pdf&amp;lt;/ref&amp;gt; Therefore, to make parallel programming easier for programmers, several common design patterns have been identified over the last decade.&lt;br /&gt;
&lt;br /&gt;
What is a design pattern? It can be defined as quality description of problem and solution to a frequently occuring problem in some domain. &amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt; In the area of parallel programming, design patterns refer to the definition, solution and guidelines for common parallelization problems. Such pattern can take the form of templates in software tools, written paragraphs with detailed description, or charts, diagrams and examples.&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/concerns.pdf&amp;lt;/ref&amp;gt; Design patterns implement various types of common process structures and interactions found in parallel systems, but with the key components - the application-specific procedures - unspecified. &amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/skeleton/dpndp.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
===A Sample Pattern===&lt;br /&gt;
Let's take a look at an example below. Although the example is not related to parallel programming, it gives you a general idea about what exactly a pattern is, which is helpful for understanding the parallel programming patterns we are going to describe in the following sections.&lt;br /&gt;
[[File:Fig1-lunchpattern.jpg|200px|thumb|right|&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
The example describes a lunch pattern. As seen from the figure, a pattern is usually made up of definition, driving forces, solution, benefits, difficulties and related patterns. We will follow this format to describe some of the parallel design patterns. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Why Do We Need Parallel Programming Patterns?==&lt;br /&gt;
--------------------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
To fully exploit the performance of parallel hardware, programmers need to design their programs in a totally different manner than serial code. This turned out to be a difficult task. The design patterns will guide the programmers systematically to achieve optimal parallel performance by providing them with the framework of parallel programs. The patterns are abstractions of commonly occuring structures and communication characteristics of parallel applications. Programmers just need to fill in their application specific procedures in these patterns, which enables them to develop parallel codes in a faster and easier manner. Also, using the patterns are likely to guarantee the correct operation of parallel application, since the data structures and communication handling codes are well tested. &amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/skeleton/dpndp.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Common Parallel Programming Patterns==&lt;br /&gt;
--------------------------------------------------------------&lt;br /&gt;
&lt;br /&gt;
In this section, we present a number of commonly used parallel programming patterns. Based on their parallelism forms, we divided the patterns into three categories - functional parallelism, data parallelism and inseparable. The first few patterns are basic patterns represented in graphs. We will focus our discussion on the patterns which uses geometric or mesh data structures.&lt;br /&gt;
&lt;br /&gt;
===Functional Parallelism Patterns===&lt;br /&gt;
------------------------------------------------------------&lt;br /&gt;
Functional parallelism focuses on distributing independent tasks across multiple processors. Each processor executes a different thread on the same or different data. Communications occur among the processors while they are executing the codes. We present two patterns in this category.&lt;br /&gt;
====Embarassingly Parallel Pattern====&lt;br /&gt;
[[File:Fig3-embpara.jpg|200px|thumb|right|&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
The figure above shows a pattern where we have multiple independent tasks, which are being distributed by the master node to four slave nodes. The tasks are independent of each other so there is no communication between the slave nodes. The communication occurs only between master and slave. The ideal situation is that the master node distribute the same amount of work to each slave node. However, this is dependent of the size of each task, which might vary significantly from one to another. Load balancing is difficult to achieve for this approach.&lt;br /&gt;
&lt;br /&gt;
====Divide &amp;amp; Conquer====&lt;br /&gt;
Divide and Conquer is an approach to solve a big problem by splitting it into several independent smaller problems, which are solved by multiple processors and then the intermediate results are merged to get the final answer. The figure below is a graphical representation of divide and conquer approach. A big problem is divided into multiple smaller problems, which are then distributed to multiple processors. An issue with this approach is that initially when the number of tasks are low, the program is not able to take full advantage of parallel hardware resources. One solution is that we can look for parallelism that exist in each subproblem and try to exploit the parallelism with other patterns.&lt;br /&gt;
[[File:Fig4-dc.jpg|200px|thumb|right|&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
===Data Parallelism Patterns===&lt;br /&gt;
--------------------------------------------------&lt;br /&gt;
====Replicable Pattern====&lt;br /&gt;
[[File:Fig5-rep.jpg|200px|thumb|right|&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
The above figure shows a replicable pattern. The issue is that we need to perform a set of operations on a global data structure, which causes dependency across different threads. To solve this problem, each thread makes a local copy of the required global data, performs a certain operation on that local data and produces a partial result. All the partial results are merged together by the master node to generate the final solution. The merging operation is called reduction. Apparently, this approach can also be considered as a divide and conquer approach.&lt;br /&gt;
&lt;br /&gt;
====Repository Pattern====&lt;br /&gt;
[[File:Fig6-repo.jpg|200px|thumb|right|&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/patterns.ppt&amp;lt;/ref&amp;gt;]]&lt;br /&gt;
&lt;br /&gt;
In this case, we have a centralized data structure which is shared by all the computation nodes. Each worker node needs to perform operations on the central data structure in a non-deterministic way. Therefore, the central data structure is controlled by a node to guarantee that each element in the data structure is only accessible by one worker node at any time.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
====Recursive Data Pattern&amp;lt;ref&amp;gt;Patterns for Parallel Programming, Timothy G. Mattson, ISBN-10: 0321228111&amp;lt;/ref&amp;gt;====&lt;br /&gt;
=====Problem Description=====&lt;br /&gt;
Suppose the problem involves an operation on a recursive data structure (such as a List, Tree, or Graph) that appears to require sequential processing. How can operations on these data structures be performed in parallel? &lt;br /&gt;
=====Driving Forces=====&lt;br /&gt;
?Recasting the problem to transform an inherently sequential traversal of the recursive data structure into one that allows all elements to be operated upon concurrently does so at the cost of increasing the total work of the computation. This must be balanced against the improved performance available from running in parallel.&lt;br /&gt;
?This recasting may be difficult to achieve (because it requires looking at the original problem from an unusual perspective) and may lead to a design that is difficult to understand and maintain.&lt;br /&gt;
?Whether the concurrency exposed by this pattern can be effectively exploited to improve performance depends on how computationally expensive the operation is and on the cost of communication relative to computation on the target parallel computer system. &lt;br /&gt;
&lt;br /&gt;
=====Solutions=====&lt;br /&gt;
The most challenging part of applying this pattern is restructuring the operations over a recursive data structure into a form that exposes additional concurrency. General guidelines are difficult to construct, but the key ideas should be clear from the examples provided below.&lt;br /&gt;
Assuming such situation: Suppose we have a forest of rooted directed trees (defined by specifying, for each node, its immediate ancestor, with a root node's ancestor being itself) and want to compute, for each node in the forest, the root of the tree containing that node. To do this in a sequential program, we would probably trace depth-first through each tree from its root to its leaf nodes; as we visit each node, we have the needed information about the corresponding root. Total running time of such a program for a forest of N nodes would be O(N). There is some potential for concurrency (operating on sub-trees concurrently), but there is no obvious way to operate on all elements concurrently, because it appears that we cannot find the root for a particular node without knowing its parent's root.[[File:Recursive1.jpg|200px|thumb|right|Finding roots in a forest. Solid lines represent the original parent-child relationships among nodes; dashed lines point from nodes to their successors.]]&lt;br /&gt;
 &lt;br /&gt;
&lt;br /&gt;
However, a rethinking of the problem exposes additional concurrency: We first define for each node a &amp;quot;successor&amp;quot;, which initially will be its parent and ultimately will be the root of the tree to which the node belongs. We then calculate for each node its &amp;quot;successor's successor&amp;quot;. For nodes one &amp;quot;hop&amp;quot; from the root, this calculation does not change the value of its successor (because a root's parent is itself). For nodes at least two &amp;quot;hops&amp;quot; away from a root, this calculation makes the node's successor its parent's parent. We repeat this calculation until it converges (that is, the values produced by one step are the same as those produced by the preceding step), at which point every node's successor is the desired value. The figure below shows an example requiring three steps to converge. At each step we can operate on all N nodes in the tree concurrently, and the algorithm converges in at most log N steps.&lt;br /&gt;
What we have done is transform the original sequential calculation (find roots for nodes one &amp;quot;hop&amp;quot; from a root, then find roots for nodes two &amp;quot;hops&amp;quot; from a root, etc.)  into a calculation that computes a partial result (successor) for each node and then repeatedly combines these partial results, first with neighboring results, then with results from nodes two hops away, then with results from nodes four hops away, and so on. This strategy can be applied to other problems that at first appear unavoidably sequential; the Examples section presents other examples. This technique is sometimes referred to as pointer jumping or recursive doubling.&lt;br /&gt;
=====Example=====&lt;br /&gt;
Algorithms developed with this pattern are a type of data parallel algorithm. They are widely used on SIMD platforms and to a lesser extent in languages such as [http://en.wikipedia.org/wiki/High_Performance_Fortran High Performance Fortran]. These platforms support the fine-grained concurrency required for the pattern and handle synchronization automatically because every computation step (logically if not physically) occurs in lockstep on all the processors. Hillis and Steele &amp;lt;ref&amp;gt; W. Daniel Hillis and Guy L. Steele,, Jr. Data parallel algorithms. Communications of the ACM, 29(12): 1170-1183, 1986.&amp;lt;/ref&amp;gt; describe several interesting applications of this pattern, including finding the end of a linked list, computing all partial sums of a linked list, region labeling in two-dimensional images, and parsing.&lt;br /&gt;
Pseudocode for finding partial sums of a list&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
for all k in parallel&lt;br /&gt;
{&lt;br /&gt;
    temp[k] = next[k];&lt;br /&gt;
    while temp[k] != null&lt;br /&gt;
    {&lt;br /&gt;
        x[temp[k]] = x[k] + x[temp[k]];&lt;br /&gt;
        temp[k] = temp [temp [k] ];&lt;br /&gt;
    }&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
In combinatorial optimization, problems involving traversing all nodes in a graph or tree can often be solved with this pattern by first finding an ordering on the nodes to create a list. [http://en.wikipedia.org/wiki/Euler_tour_technique Euler tour] and [http://en.wikipedia.org/wiki/Ear_decomposition Ear decomposition] are well-known techniques to compute this ordering.&lt;br /&gt;
&lt;br /&gt;
====Geometric and Irregular Mesh Pattern&amp;lt;ref&amp;gt; http://www.cs.uiuc.edu/homes/snir/PPP/patterns/AMR.pdf&amp;lt;/ref&amp;gt;====&lt;br /&gt;
=====Problem Description=====&lt;br /&gt;
Data-parallelism is exposed on a geometric mesh structure (either irregular or regular),&lt;br /&gt;
Where each point iteratively communicates with nearby neighboring points in computing a&lt;br /&gt;
Solution until a convergence has been reached. There is a system of formula that&lt;br /&gt;
Characterizes and governs the global and local behavior of the mesh structure, exposing each partition of mesh elements to different error and accuracy as the computation progresses in steps. Due to the varying error among the partitions, some mesh points provide sufficiently accurate results within a short number of steps, while others exhibit inaccurate results that may require “refined” computation at more fine-grained resolution. Efficiency is an important requirement of the process, thus it is necessary to adaptively refine meshes for selected regions, while leaving out uninteresting part of the domain at a lower resolution.&lt;br /&gt;
&lt;br /&gt;
=====Driving Forces=====&lt;br /&gt;
*Performance of the adaptively refined computation on the mesh structure must be higher than uniformly refined computation. In other words, the overhead of maintaining adaptive features must be relatively low.  &lt;br /&gt;
*To provide accurate criterion for further refinement, a good local error estimate must be obtained locally without consulting the global mesh structure. Since global knowledge is limited, useful heuristics must be employed to calculate the local error. &lt;br /&gt;
&lt;br /&gt;
*Efficient data structure needs to be used to support frequent structural resolution change and to preserve data locality across subsequent refinement.&lt;br /&gt;
*Partition and re-partition of the mesh structure after each refinement stage must provide each processing unit balanced computational load and minimum communication overhead.&lt;br /&gt;
*At each refinement stage, data migration and work stealing needs to be implemented for dynamically balancing the computational load.  &lt;br /&gt;
&lt;br /&gt;
=====Solutions=====&lt;br /&gt;
The solution is an iterative process that consists of multiple components. First, we need an initial partition to divide the mesh points among the processing units. Second, an error indicator that will evaluate how close the locally computed results are to the real solution. Thirdly, when the error is above certain tolerance level, the partition needs to be “refined”, meaning that the mesh size will be reduced by a factor of a constant (usually by power of two). Fourth, as the partition gets altered the mapping of the data elements to the processing units must also be adjusted for better load balance while keeping the data locality at the same time. All these components repeat themselves under efficient data structures designed for efficient access and locality preservation. The outline the algorithm for the Adaptive-Mesh-Refinement pattern, it looks as the following. &lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
n = number of processors;&lt;br /&gt;
m = mesh structure;&lt;br /&gt;
Initially partition m over n processors;&lt;br /&gt;
while (not all partitions satisfy error tolerance) {&lt;br /&gt;
	compute locally value of partition p;  &lt;br /&gt;
	// using the system of equations.&lt;br /&gt;
	&lt;br /&gt;
	for each mesh points mp in partition p {&lt;br /&gt;
		if errorEstimate(mp) &amp;gt; tol) {&lt;br /&gt;
			mark mp for refinement;&lt;br /&gt;
		}&lt;br /&gt;
	}&lt;br /&gt;
	refine mesh structure where marked;&lt;br /&gt;
	redistribute m OR &lt;br /&gt;
	migrate individual data between processors;&lt;br /&gt;
}&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
=====Example=====&lt;br /&gt;
[http://ww2.cs.mu.oz.au/498/notes/node51.html Barnes-Hut algorithm] is used in n-body particle simulation problem, to compute the force between each pair among n particles, and thereby updating their positions. The spatial domain is represented as Quad or Octal tree, and iteratively updated as each particle’s position gets changed during the simulation. One special property that the algorithm exploits is the distance between the particles. If a cluster of particle is located sufficiently far away from another, its effective force can be approximated by using a centre of mass of the cluster (without having to visit individual nodes). This effective force can be calculated recursively, thus reducing the number of traversals required to compute force interactions. &lt;br /&gt;
Although quite different from partial differential equations, Barnes-Hut also exhibit adaptive refinement aspect in a simply way: it adaptively updates the tree representation depending on the distance of the particles. Unlike PDEs that refines triggered by error condition, Barnes-Hut updates the representation depending on the distance of the particles. The “distance indicator” will act as a guideline in whether to refine or even retract a given data representation as the particle positions move throughout the iterations. As this happens proper load balancing and re-distribution must be followed to increase the performance of the program. &lt;br /&gt;
&lt;br /&gt;
====Pipeline Pattern&amp;lt;ref&amp;gt; Patterns for Parallel Programming, Timothy G. Mattson, ISBN-10: 0321228111&amp;lt;/ref&amp;gt;====&lt;br /&gt;
=====Problem Description=====&lt;br /&gt;
Suppose that the overall computation involves performing a calculation on many sets of data, where the calculation can be viewed in terms of data flowing through a sequence of stages. How can the potential concurrency be exploited?&lt;br /&gt;
An assembly line is a good analogy for this pattern. Suppose we want to manufacture a number of cars. The manufacturing process can be broken down into a sequence of operations each of which adds some component, say the engine or the windshield, to the car. An assembly line (pipeline) assigns a component to each worker. As each car moves down the assembly line, each worker installs the same component over and over on a succession of cars. After the pipeline is full (and until it starts to empty) the workers can all be busy simultaneously, all performing their operations on the cars that are currently at their stations.&lt;br /&gt;
&lt;br /&gt;
=====Driving Forces=====&lt;br /&gt;
*A good solution should make it simple to express the ordering constraints. The ordering constraints in this problem are simple and regular and lend themselves to being expressed in terms of data flowing through a pipeline.&lt;br /&gt;
*The target platform can include special-purpose hardware that can perform some of the desired operations.&lt;br /&gt;
* In some applications, future additions, modifications, or reordering of the stages in the pipeline are expected.&lt;br /&gt;
* In some applications, occasional items in the input sequence can contain errors that prevent their processing.&lt;br /&gt;
=====Solutions=====&lt;br /&gt;
The key idea of this pattern is captured by the assembly-line analogy, namely that the potential concurrency can be exploited by assigning each operation (stage of the pipeline) to a different worker and having them work simultaneously, with the data elements passing from one worker to the next as operations are completed. In parallel-programming terms, the idea is to assign each task (stage of the pipeline) to a UE and provide a mechanism whereby each stage of the pipeline can send data elements to the next stage. This strategy is probably the most straightforward way to deal with this type of ordering constraints. It allows the application to take advantage of special purpose hardware by appropriate mapping of pipeline stages to PEs and provides a reasonable mechanism for handling errors, described later. It also is likely to yield a modular design that can later be extended or modified.&lt;br /&gt;
[[File:Pipeline1 yw.jpg|200px|thumb|right]]&lt;br /&gt;
=====Example=====&lt;br /&gt;
A type of calculation widely used in signal processing involves performing the following computations repeatedly on different sets of data.&lt;br /&gt;
*Perform a discrete Fourier transform (DFT) on a set of data.&lt;br /&gt;
* Manipulate the result of the transform elementwise.&lt;br /&gt;
* Perform an inverse DFT on the result of the manipulation.&lt;br /&gt;
===Inseparable Patterns===&lt;br /&gt;
-----------------------------------------------------------&lt;br /&gt;
There are some other situations that the patterns above cannot fit. However, when some elements are accessed, they need explicit protection. Examples like [http://en.wikipedia.org/wiki/Mutual_exclusion mutual exclusion] and [http://en.wikipedia.org/wiki/Producer-consumer_problem producer-consumer]&lt;br /&gt;
Also, there are some patterns are vaguely defined. &lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Limitations of Parallel Patterns==&lt;br /&gt;
--------------------------------------------------------------&lt;br /&gt;
Some argue that the parallel patterns identified so far have little to none contribution towards parallel software development. They claim that parallel design patterns are too trivial and fail to give detailed guidelines for specific parameters that can affect the performance of wide range of complex problem requirements. By experience&lt;br /&gt;
in the community, it is coming to a consensus that there are only limited number of patterns; namely pipeline, master &amp;amp; slave, divide &amp;amp; conquer, geometric, replicable, repository, and not many more. So far the community efforts were focused on discovering more design patterns, but new patterns vary a little and fall into range that is not far from one of the patterns mentioned above. The source of ineffectiveness of these patterns, in fact, does not come from its lack of variety, but it comes from inflexibility of how existing patterns are presented.&amp;lt;ref&amp;gt;http://www.cs.uiuc.edu/homes/snir/PPP/patterns/concerns.pdf&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Comparison==&lt;br /&gt;
--------------------------------------------------------------------------&lt;br /&gt;
Although the patterns are very different from each other since they should fit different problems, they do have some similarities. They all have to start from decomposing some sequential problem to expose concurrency. As the two categories state above, the Data parallelism patterns expose concurrency from working on different data elements at the same time. On the other hand, Functional parallelism patterns expose concurrency from working on independent functional tasks at the same time&amp;lt;ref&amp;gt; http://www.cs.uiuc.edu/homes/snir/PPP/ &amp;lt;/ref&amp;gt;.&lt;br /&gt;
Also, if focused on some certain patterns which resolve similar problems, we can see that they have something in common. For example, in mesh pattern and pipeline pattern, many dependencies exist. On the other hand, in repository and Dvide &amp;amp; Conquer pattern access same memory location a lot. Many memory protections are needed in such patterns.&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==Quiz==&lt;br /&gt;
-------------------------------------------------------&lt;br /&gt;
1. Which of the following is NOT defined by a parallel design pattern? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. Communication framework&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. Application-specific procedures&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. Data structure&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. Type of parallelism&amp;lt;br /&amp;gt;&lt;br /&gt;
	&lt;br /&gt;
&lt;br /&gt;
2. Which of the following is true for embarassingly parallel pattern? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. Slave nodes communicate to each other&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. Slave nodes only communicate to master node&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. Slave nodes are responsible for load balancing&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. None of the above&amp;lt;br /&amp;gt;&lt;br /&gt;
	&lt;br /&gt;
&lt;br /&gt;
3. Which of the following is a characteristic of replicable parallel pattern? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. Slave nodes perform operations on the global data structure directly. &amp;lt;br /&amp;gt;&lt;br /&gt;
	B. Slave nodes have their own copy of required data. &amp;lt;br /&amp;gt;&lt;br /&gt;
	C. Slave nodes are responsible for reduction. &amp;lt;br /&amp;gt;&lt;br /&gt;
	D. None of the above&amp;lt;br /&amp;gt;&lt;br /&gt;
	&lt;br /&gt;
&lt;br /&gt;
4. Which pattern has a centralized data structure which independent computations need to be applied to.&amp;lt;br /&amp;gt;&lt;br /&gt;
	A. Pipeline.&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. Repository.&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. Mesh.&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. Replicable.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
5. Which structure(s) can be example(s) of recursive data structure(s) &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. array.&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. list.&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. tree.&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. graph.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
6. Barnes-Hut algorithm is an example of which pattern? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. mesh pattern.&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. pipeline pattern.&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. recursive data pattern.&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. replicable.&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
7. Which are steps to solve the geometric pattern problem..&amp;lt;br /&amp;gt;&lt;br /&gt;
	A. we need an initial partition to divide the mesh points among the processing units..&amp;lt;br /&amp;gt;&lt;br /&gt;
B. An error indicator that will evaluate how close the locally computed results are to the real solution. &amp;lt;br /&amp;gt;&lt;br /&gt;
C. When the error is above certain tolerance level, the partition needs to be “refined”..&amp;lt;br /&amp;gt;&lt;br /&gt;
D. Setup protection for centralized data structure. &amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
8. Which may be the most challenging part to solve the recursive data structure pattern problem? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. rewrite to a form can expose concurrency&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. balance the load of each processor&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. protection of data structure&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. deal with dependencies&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
9.  Which pattern can be metaphorized to manufactoria? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. Pipeline pattern&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. Recursive data pattern&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. Replicable&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. Geometric&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
10.  Which is a practice example for pipeline pattern? &amp;lt;br /&amp;gt;&lt;br /&gt;
	A. signal processing&amp;lt;br /&amp;gt;&lt;br /&gt;
	B. tree traversal&amp;lt;br /&amp;gt;&lt;br /&gt;
	C. ocean simulation&amp;lt;br /&amp;gt;&lt;br /&gt;
	D. Barnes-Hut algorithm&amp;lt;br /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
&lt;br /&gt;
&amp;lt;references&amp;gt;&amp;lt;/references&amp;gt;&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig6-repo.jpg&amp;diff=58279</id>
		<title>File:Fig6-repo.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig6-repo.jpg&amp;diff=58279"/>
		<updated>2012-02-11T05:49:29Z</updated>

		<summary type="html">&lt;p&gt;Pxu: Repository pattern&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Repository pattern&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig5-rep.jpg&amp;diff=58278</id>
		<title>File:Fig5-rep.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig5-rep.jpg&amp;diff=58278"/>
		<updated>2012-02-11T05:42:26Z</updated>

		<summary type="html">&lt;p&gt;Pxu: Replicable pattern&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Replicable pattern&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig4-dc.jpg&amp;diff=58277</id>
		<title>File:Fig4-dc.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig4-dc.jpg&amp;diff=58277"/>
		<updated>2012-02-11T05:25:53Z</updated>

		<summary type="html">&lt;p&gt;Pxu: Divide and Conquer&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Divide and Conquer&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig3-embpara.jpg&amp;diff=58276</id>
		<title>File:Fig3-embpara.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig3-embpara.jpg&amp;diff=58276"/>
		<updated>2012-02-11T05:00:30Z</updated>

		<summary type="html">&lt;p&gt;Pxu: Embarassingly Parallel Pattern&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;Embarassingly Parallel Pattern&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
	<entry>
		<id>https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig1-lunchpattern.jpg&amp;diff=58275</id>
		<title>File:Fig1-lunchpattern.jpg</title>
		<link rel="alternate" type="text/html" href="https://wiki.expertiza.ncsu.edu/index.php?title=File:Fig1-lunchpattern.jpg&amp;diff=58275"/>
		<updated>2012-02-11T03:51:45Z</updated>

		<summary type="html">&lt;p&gt;Pxu: &amp;quot;Lunch&amp;quot; pattern&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;&amp;quot;Lunch&amp;quot; pattern&lt;/div&gt;</summary>
		<author><name>Pxu</name></author>
	</entry>
</feed>