CSC/ECE 506 Spring 2012/8a cj: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 6: Line 6:
[[Image:Busbased SMP.jpg|frame|center|<b>Figure 1:</b> Typical Bus-Based Processor Model]]
[[Image:Busbased SMP.jpg|frame|center|<b>Figure 1:</b> Typical Bus-Based Processor Model]]


==Summary of Cache Coherence Protocols==
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''[http://en.wikipedia.org/wiki/Cache_coherence#Coherency_protocol coherence protocol]''''' ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are shown in ''Table 1'':
<center>
{| class="wikitable" border="1"
|-
|  '''States'''
|  '''Access Type'''
|  '''Invariant'''
|-
|  '''Modified'''
|  read, write
|  all other caches in I state
|-
|  '''Exclusive'''
|  read
|  all other caches in I state
|-
|  '''Owned'''
|  read
|  all other caches in I or S state
|-
|  '''Shared'''
|  read
|  no other cache in M or E state
|-
|  '''Invalid'''
|  -
|  -
|}
<b>Table 1:</b> States Used by the Cache Coherent Protocols</center>
<br />
<br />
The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss them in details in the following sections.
===MSI===
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''' is a three-state write-back '''invalidation protocol''' which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line with one of the '''Modified (M) ,Shared (S)''', and '''Invalid (I)''' state. '''Invalid''' means the cache line is either not present or is in invalid state. If the cache line is clean and is shared by more than one processor , it is marked as '''Shared'''. If a cache line is dirty and the processor has exclusive ownership of it, it is in '''Modified''' state. BusRdx causes other processors to invalidate (demote) its cache block to the '''I''' state. If it is present in the '''M''' state in another cache, it will flush.
The following state transition diagram for MSI protocol explains the working of the protocol:
[[Image:MSI.jpg|frame|center|<b>Figure 2:</b> MSI State Diagram]]
===MESI===
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''' has a major drawback in that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the E ('''Exclusive''') state to distinguish between a cache line stored in multiple caches and a line stored in a single cache.
Let us briefly look at how the MESI protocol works. (For a more detailed version readers are referred to Solihin textbook pg. 215)
'''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' coherence protocol marks each cache line in one of the Modified, Exclusive, Shared or Invalid state.
* '''Invalid''' : The cache line is either not present or is invalid
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only
* '''Modified''' :  This implies that the cache line is dirty and the core/processor has  exclusive ownership of the cache line, exclusive of the memory also.
* '''Shared''' : The cache line is clean and is shared by more than one core/processor
In a nutshell, the '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol works as the following:
A line that is just fetched receives '''E''' or '''S''' state depending on whether it exists in other processors in the system or not. Similarly, a cache line gets the '''M''' state when a processor writes to it. If the line is not in '''E''' or '''M''' state prior to being written to, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals call it, a “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the '''I''' state in all other processors on the bus (if any). The table shown below will summarize '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol.
<center>
{| class="wikitable" border="1"
|-
|  '''Cache Line State:'''
|  '''Modified'''
|  '''Exclusive'''
|  '''Shared'''
|  '''Invalid'''
|-
|  '''This cache line is valid?'''
|  Yes
|  Yes
|  Yes
|  No
|-
|  '''The memory copy is…'''
|  out of date
|  valid
|  valid
|  -
|-
|  '''Copies exist in caches of other processors?''' 
|  No
|  No
|  Maybe
|  Maybe
|-
|  '''A write to this line'''
|  does not go to bus
|  does not go to bus
|  goes to bus and updates cache
|  goes directly to bus
|}
<b>Table 2:</b> MESI Protocol Summary
</center>
The transition diagram from the lecture slides is given below for reference.
[[Image:MESI.jpg|frame|center|<b>Figure 4:</b> MESI State Diagram]]
===MESIF===
The '''[http://en.wikipedia.org/wiki/MESIF_protocol MESIF]''' protocol, used in the latest Intel multi-core processors Core i7, was introduced to '''accommodate the point-to-point''' links used in the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect]. Using the '''MESI''' protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the requesting processor only needs a single copy of the data, the system would be '''wasting the bandwidth'''.
As a '''solution''' to this problem, an additional state, '''Forward state''', was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant.  Hence, by designating a single cache line to '''respond to requests''', coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions to the S state. That is, when a cache line in the '''F''' state is '''copied''', the F state '''migrates''' to the '''newer copy''', while the '''older''' one drops back to the '''S''' state. Moving the new copy to the F state '''exploits''' both '''temporal and spatial locality'''. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several cores.
All M to S state transitions and E to S state transitions will now be from '''M to F''' and '''E to F'''. 
The '''F state''' is '''different''' from the '''Owned state''' of the MOESI protocol as it is '''not''' a '''unique''' copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired.
More information on the QuickPath Interconnect and MESIF protocol can be found at
'''[http://www.intel.com/technology/quickpath/introduction.pdf Introduction to QuickPath Interconnect]'''.
===MOESI===
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was AMD’s first-generation dual core processor which had two distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multicore processors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel to AMD dual core Opteron, used the [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol to handle cache coherence. [http://en.wikipedia.org/wiki/MESI_protocol MESI] came with the drawback of using much time and bandwidth in certain situations.
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was AMD’s answer to this problem. [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] added a fifth state to [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol called the '''owned''' state. [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol addresses the bandwidth problem faced by MESI protocol when processor having invalid data in its cache wants to modify the data.  The processor seeking the data will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing.  When the data is held by a processor in the new state '''owned''', it can provide other processors the modified data without writing it to the main memory. This is called '''''dirty sharing'''''. The processor with the data in the '''owned''' state stays responsible to update the main memory later when the cache line is evicted.
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI]  has become one of the most popular snoop-based protocols supported in AMD64 architecture.  AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.
The five different states of [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol are:
* '''Modified (M)''': The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.
* '''Owned (O)''': The cache line has the most recent correct copy of the data. This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.
* '''Exclusive (E)''': A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.
* '''Shared (S)''': A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors.
* '''Invalid (I)''': A cache line does not hold a valid copy of the data.
A detailed explanation of this protocol implementation on AMD processor can be found in the manual [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64-bit core]
The following table summarizes the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol:
<center>
{| class="wikitable" border="1"
|-
|  '''Cache Line State:'''
|  '''Modified'''
|  '''Owner'''
|  '''Exclusive'''
|  '''Shared'''
|  '''Invalid'''
|-
|  '''This cache line is valid?'''
|  Yes
|  Yes
|  Yes
|  Yes
|  No
|-
|  '''The memory copy is…'''
|  out of date
|  out of date
|  valid
|  valid
|  -
|-
|  '''Copies exist in caches of other processors?'''
|  No
|  No
|  Yes(out of date values)|  Maybe
|  Maybe
|  -
|-
|  '''A write to this line'''
|  does not go to bus
|  does not go to bus|  does not go to bus
|  goes to bus and updates cache
|  goes directly to bus
|  -
|}
<b>Table 3:</b> MOESI Protocol Summary
</center>
State transition for [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]is as shown below :
[[Image:MOESI_State_Transition_Diagram.jpg|frame|center|<b>Figure 7:</b> [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]State transition Diagram]]
We will now discuss which cache coherence protocol processor vendors like '''Intel''', '''AMD''' and others adopt in various products.
===Cache Coherence Protocols and Their Performance===
Of all protocols, MSI appears to have the worst performance due to the fact that each read-write sequence incurs two bus transactions even if the data is stored in only one cache. This causes a huge performance penalty for highly optimized parallel programs that have little data sharing. MESI protocol solves this problem by introducing the E (Exclusive) state to distinguish between a cache line stored in multiple caches and a line stored in only one cache. This protocol, however, can still be improved further.
One remaining issue with MESI protocol is that in multi-processor configuration, when a processor requests a cache line that is stored in multiple cache, every cache might respond with the data. This behavior would waste bus bandwidth and incur high latency. As a solution to this problem, MESIF protocol was introduced. An additional state, Forward state, was added by slightly changing the role of the Shared state. The result of this improvement is that instead of all caches responding, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. As a result, performance and power efficiency is increased.[23]
Another problem with MESI protocol is when a processor having invalid data in its cache wants to modify the data, it will have to wait for the other processor to write back to the memory, which is a very slow process. MOESI protocol solves this problem by allowing dirty
sharing. This also saves bus bandwidth and increases performance.[24]





Revision as of 14:43, 20 March 2012

Introduction to bus-based cache coherence in real machines

Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. The most common multiprocessors today use SMP architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called cache coherence problem. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.

Figure 1: Typical Bus-Based Processor Model

Summary of Cache Coherence Protocols

At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The coherence protocol ensures the invariants of the states are maintained. The different coherent states used by most of the cache coherent protocols are shown in Table 1:

States Access Type Invariant
Modified read, write all other caches in I state
Exclusive read all other caches in I state
Owned read all other caches in I or S state
Shared read no other cache in M or E state
Invalid - -
Table 1: States Used by the Cache Coherent Protocols




The first widely adopted approach to cache coherence is snooping on a bus. We will now discuss them in details in the following sections.

MSI

MSI protocol is a three-state write-back invalidation protocol which is one of the earliest snooping-based cache coherence-protocols. It marks the cache line with one of the Modified (M) ,Shared (S), and Invalid (I) state. Invalid means the cache line is either not present or is in invalid state. If the cache line is clean and is shared by more than one processor , it is marked as Shared. If a cache line is dirty and the processor has exclusive ownership of it, it is in Modified state. BusRdx causes other processors to invalidate (demote) its cache block to the I state. If it is present in the M state in another cache, it will flush.

The following state transition diagram for MSI protocol explains the working of the protocol:

Figure 2: MSI State Diagram

MESI

MSI protocol has a major drawback in that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. MESI protocol solves this problem by introducing the E (Exclusive) state to distinguish between a cache line stored in multiple caches and a line stored in a single cache. Let us briefly look at how the MESI protocol works. (For a more detailed version readers are referred to Solihin textbook pg. 215)

MESI coherence protocol marks each cache line in one of the Modified, Exclusive, Shared or Invalid state.

  • Invalid : The cache line is either not present or is invalid
  • Exclusive : The cache line is clean and is owned by this core/processor only
  • Modified : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line, exclusive of the memory also.
  • Shared : The cache line is clean and is shared by more than one core/processor

In a nutshell, the MESI protocol works as the following: A line that is just fetched receives E or S state depending on whether it exists in other processors in the system or not. Similarly, a cache line gets the M state when a processor writes to it. If the line is not in E or M state prior to being written to, the cache sends a Bus Upgrade (BusUpgr) signal or as the Intel manuals call it, a “Read-For-Ownership (RFO) request” that ensures that the line exists in the cache and is in the I state in all other processors on the bus (if any). The table shown below will summarize MESI protocol.

Cache Line State: Modified Exclusive Shared Invalid
This cache line is valid? Yes Yes Yes No
The memory copy is… out of date valid valid -
Copies exist in caches of other processors? No No Maybe Maybe
A write to this line does not go to bus does not go to bus goes to bus and updates cache goes directly to bus

Table 2: MESI Protocol Summary

The transition diagram from the lecture slides is given below for reference.

Figure 4: MESI State Diagram

MESIF

The MESIF protocol, used in the latest Intel multi-core processors Core i7, was introduced to accommodate the point-to-point links used in the QuickPath Interconnect. Using the MESI protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. As a solution to this problem, an additional state, Forward state, was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist. Also, on a read request, the F state transitions to the S state. That is, when a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to the S state. Moving the new copy to the F state exploits both temporal and spatial locality. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several cores. All M to S state transitions and E to S state transitions will now be from M to F and E to F. The F state is different from the Owned state of the MOESI protocol as it is not a unique copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired.

More information on the QuickPath Interconnect and MESIF protocol can be found at Introduction to QuickPath Interconnect.

MOESI

AMD Opteron was AMD’s first-generation dual core processor which had two distinct K8 cores together on a single die. Cache coherence produces bigger problems on such multicore processors. It was necessary to use an appropriate coherence protocol to address this problem. The Intel Xeon, which was the competitive counterpart from Intel to AMD dual core Opteron, used the MESI protocol to handle cache coherence. MESI came with the drawback of using much time and bandwidth in certain situations.

MOESI was AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called the owned state. MOESI protocol addresses the bandwidth problem faced by MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state owned, it can provide other processors the modified data without writing it to the main memory. This is called dirty sharing. The processor with the data in the owned state stays responsible to update the main memory later when the cache line is evicted.

MOESI has become one of the most popular snoop-based protocols supported in AMD64 architecture. AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.

The five different states of MOESI protocol are:

  • Modified (M): The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.
  • Owned (O): The cache line has the most recent correct copy of the data. This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.
  • Exclusive (E): A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.
  • Shared (S): A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors.
  • Invalid (I): A cache line does not hold a valid copy of the data.

A detailed explanation of this protocol implementation on AMD processor can be found in the manual Architecture of AMD 64-bit core

The following table summarizes the MOESI protocol:

Cache Line State: Modified Owner Exclusive Shared Invalid
This cache line is valid? Yes Yes Yes Yes No
The memory copy is… out of date out of date valid valid -
Copies exist in caches of other processors? No No Maybe Maybe -
A write to this line does not go to bus does not go to bus goes to bus and updates cache goes directly to bus -

Table 3: MOESI Protocol Summary

State transition for MOESIis as shown below :

Figure 7: MOESIState transition Diagram

We will now discuss which cache coherence protocol processor vendors like Intel, AMD and others adopt in various products.

Cache Coherence Protocols and Their Performance

Of all protocols, MSI appears to have the worst performance due to the fact that each read-write sequence incurs two bus transactions even if the data is stored in only one cache. This causes a huge performance penalty for highly optimized parallel programs that have little data sharing. MESI protocol solves this problem by introducing the E (Exclusive) state to distinguish between a cache line stored in multiple caches and a line stored in only one cache. This protocol, however, can still be improved further.

One remaining issue with MESI protocol is that in multi-processor configuration, when a processor requests a cache line that is stored in multiple cache, every cache might respond with the data. This behavior would waste bus bandwidth and incur high latency. As a solution to this problem, MESIF protocol was introduced. An additional state, Forward state, was added by slightly changing the role of the Shared state. The result of this improvement is that instead of all caches responding, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. As a result, performance and power efficiency is increased.[23]

Another problem with MESI protocol is when a processor having invalid data in its cache wants to modify the data, it will have to wait for the other processor to write back to the memory, which is a very slow process. MOESI protocol solves this problem by allowing dirty sharing. This also saves bus bandwidth and increases performance.[24]


MSI, MESI, MESIF, and MOESI protocols on real architectures

MSI Protocol

Figure 1: MSI State Diagram

MESI Protocol

Figure 2: MESI State Diagram

Five State Protocols

MOESI

Figure 3: MOESI State Diagram

MESIF

Figure 4: Reduced Traffic with MESIF Protocol [1]

Protocol Performance

Protocol performance is strongly linked to the program that is running on the parallel system. This complicates protocol performance evaluation and prevents the designer from definitively choosing the best protocol. This binding of program and protocol makes "Making design decisions in real systems ... part art and part science. The art is the past experience, intuition, and aesthetics of the designers, and the science is workload-driven evaluation." [4]

While making definitive design decisions for machine designed to run a variety of programs is challenging, the performance of various cache parameters for specific programs can be simulated allowing for the prediction of general trends. Figure 5 shows the performance of the MESI protocol (III) and the standard MSI 3 state protocol (3st). The MSI protocol when BusRdX instead of BusUpgr is used for S -> M transitions is also shown (3st-RdX)

Figure 5: Traffic for various protocols [2]
Figure 6: Type of Miss [3]

References

  1. The research of the inclusive cache used in multi-core processor
  2. Parallel Computer Architecture: A Hardware/Software Approach