CSC/ECE 506 Spring 2012/8a cj: Difference between revisions
(→MESIF) |
|||
(36 intermediate revisions by 2 users not shown) | |||
Line 2: | Line 2: | ||
Most parallel software in the commercial market relies on the shared-memory | Most parallel software in the commercial market relies on the shared-memory | ||
programming model in which all processors access the same physical address space. The most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP]) the [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called '''''[http://en.wikipedia.org/wiki/Cache_coherence cache coherence problem]'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory. | programming model in which all processors access the same physical address space. The most common multiprocessors today use [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or [http://en.wikipedia.org/wiki/Multi-core_(computing) CMP]) the [http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP] architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called the '''''[http://en.wikipedia.org/wiki/Cache_coherence cache coherence problem]'''''. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory. | ||
[[Image:Busbased SMP.jpg|frame|center|<b>Figure 1:</b> Typical Bus-Based Processor Model]] | [[Image:Busbased SMP.jpg|frame|center|<b>Figure 1:</b> Typical Bus-Based Processor Model]] | ||
==Summary of Cache Coherence Protocols== | ==Summary of Cache Coherence Protocols== | ||
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''[http://en.wikipedia.org/wiki/Cache_coherence#Coherency_protocol coherence protocol]''''' ensures the | At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The '''''[http://en.wikipedia.org/wiki/Cache_coherence#Coherency_protocol coherence protocol]''''' ensures the invariance of the states are maintained. The different coherent states used by most of the cache coherent protocols are shown in ''Table 1'': | ||
<center> | <center> | ||
Line 41: | Line 41: | ||
<br /> | <br /> | ||
<br /> | <br /> | ||
The first widely adopted approach to cache coherence | The first widely adopted approach to cache coherence uses snooping on a bus. Snooping entails each processor seeing or "snooping" all bus transactions. Common bus transactions are | ||
* '''BusRd''' The cache controller asks for a copy of the data it does not intend to modify | |||
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''' | * '''Bus Read-Exclusive (BusRdX)''' The cache controller asks for a copy of the data it intends to modify (all other caches need to be invalidated) | ||
* '''Writeback (BusWB)''' Main memory updates its contents | |||
* '''Bus upgrade (BusUpgr)''' Bus upgrade obtains exclusive ownership and invalidates other copies - bus does not return data to the requestor | |||
==MSI== | |||
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''', one of the earliest snooping-based cache coherence-protocols, is a three-state write-back invalidation protocol. It marks the cache line with one of three possible states: '''Modified (M), Shared (S)''', and '''Invalid (I)'''. If a cache line is dirty and the processor has exclusive ownership of it, it is in Modified state. If the cache line is clean and is shared by more than one processor , it is marked as Shared. Invalid means the cache line is either not present or is in invalid state. BusRdx causes all other processors to invalidate (demote) their copy of the cache line to the I state. If the cache line is present in the M state in another cache, it will flush. | |||
The following state transition diagram for MSI protocol explains the working of the protocol: | The following state transition diagram for MSI protocol explains the working of the protocol: | ||
Line 50: | Line 55: | ||
[[Image:MSIfig.jpg|thumb|400px|center|<b>Figure 3:</b> MSI State Diagram]] | [[Image:MSIfig.jpg|thumb|400px|center|<b>Figure 3:</b> MSI State Diagram]] | ||
===MESI=== | ===MSI Implementation Complexities=== | ||
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''' has a major drawback in that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the E ( | ---- | ||
MSI is the most basic cache coherence protocols, and also the easiest to implement. However, the MSI protocol requires that all state changes occur atomically to ensure cache coherency, and is therefore not recommended for real-world, complex systems.[http://books.google.com/books?id=eqSYY7CRWhoC&pg=PA107&lpg=PA107&dq=msi+cache++implementation&source=bl&ots=V3sO0fSt5a&sig=HmK3ntQ3daI7uTaowAgi7crVgcU&hl=en&sa=X&ei=2CRpT7rkMcWrsQKB8fmvCQ&ved=0CFoQ6AEwBg#v=onepage&q=msi%20cache%20%20implementation&f=true] Adding an exclusive, owner, or forwarding state to the architecture increases the cache coherence design complexity, but greatly improves performance. For this reason, most modern multiprocessors use a variant of MSI (MESI, MOSI, or MOESI). [http://en.wikipedia.org/wiki/MSI_protocol] | |||
==MESI== | |||
'''[http://en.wikipedia.org/wiki/MSI_protocol MSI protocol]''' has a major drawback in that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol solves this problem by introducing the E (Exclusive) state to distinguish between a cache line stored in multiple caches and a line stored in a single cache. | |||
Let us briefly look at how the MESI protocol works. (For a more detailed version readers are referred to Solihin textbook pg. 215) | Let us briefly look at how the MESI protocol works. (For a more detailed version readers are referred to Solihin textbook pg. 215) | ||
'''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' coherence protocol marks each cache line in one of the Modified, Exclusive, Shared or Invalid | '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' coherence protocol marks each cache line as being in one of the Modified, Exclusive, Shared or Invalid states. | ||
* '''Invalid''' : The cache line is either not present or is invalid | * '''Invalid''' : The cache line is either not present or is invalid | ||
* '''Exclusive''' : The cache line is clean and is owned by this core/processor only | * '''Exclusive''' : The cache line is clean and is owned by this core/processor only | ||
Line 60: | Line 69: | ||
* '''Shared''' : The cache line is clean and is shared by more than one core/processor | * '''Shared''' : The cache line is clean and is shared by more than one core/processor | ||
The '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol works as | The '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol works as follows: | ||
A line that is just fetched receives | A line that is just fetched receives E or S state depending on whether it exists in other processors in the system or not. Similarly, a cache line gets the M state when a processor writes to it. If the line is not in E or M state prior to being written to, the cache sends a Bus Upgrade (BusUpgr) signal or, as the Intel manuals call it, a “Read-For-Ownership (RFO) request” which ensures that the line exists in the cache and is in the I state in all other processors on the bus (if any). The table shown below will summarize the '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol. | ||
<center> | <center> | ||
Line 100: | Line 109: | ||
[[Image:MESIFig.jpg|thumb|400px|center|<b>Figure 3:</b> MESI State Diagram]] | [[Image:MESIFig.jpg|thumb|400px|center|<b>Figure 3:</b> MESI State Diagram]] | ||
===MESI Implementation Complexities=== | |||
---- | |||
In 1984, researchers at the University of Illinois developed and published a new, improved cache coherence protocol. This protocol added an "exclusive" state to the MSI protocol. The new state improves performance by reducing bus traffic when data is being used by a single processor. For more information on MESI, here is an excerpt of the original article: | |||
[http://www.deepdyve.com/lp/association-for-computing-machinery/a-low-overhead-coherence-solution-for-multiprocessors-with-private-AqsgSDo8s0 A low-overhead coherence solution for multiprocessors with private cache memories] | |||
The improved performance comes at the cost of greater implementation complexity. The MESI protocol requires an additional COPIES-EXIST bus line, which is not necessary when using MSI. Transferring data between caches is also more complicated, especially the FlushOpt (a snooped request used to indicate that a cache block has been posted on the bus for another processor). In the worst case, several cache controllers may attempt to flush the same cache block to the bus, only to realize that another controller has already completed the FlushOpt. They will have wasted time and power by reading the cache, trying to access the bus, and then canceling the operation. | |||
For more details, see pages 215-220 of the Solihin textbook. | |||
== | ==MOESI== | ||
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was AMD’s first-generation dual core processor which had two distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multicore processors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel to AMD dual core Opteron, used the [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol to handle cache coherence. [http://en.wikipedia.org/wiki/MESI_protocol MESI] came with the drawback of using too much time and bandwidth in certain situations. | |||
= | |||
[http://en.wikipedia.org/wiki/Opteron AMD Opteron] was AMD’s first-generation dual core processor which had two distinct [http://en.wikipedia.org/wiki/Athlon_64 K8 cores] together on a single die. Cache coherence produces bigger problems on such multicore processors. It was necessary to use an appropriate coherence protocol to address this problem. The [http://en.wikipedia.org/wiki/Xeon Intel Xeon], which was the competitive counterpart from Intel to AMD dual core Opteron, used the [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol to handle cache coherence. [http://en.wikipedia.org/wiki/MESI_protocol MESI] came with the drawback of using much time and bandwidth in certain situations. | |||
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was AMD’s answer to this problem. [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] added a fifth state to [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol called the | [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] was AMD’s answer to this problem. [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] added a fifth state to [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol called the owned state. [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol addresses the bandwidth problem faced by MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state owned, it can provide other processors the modified data without writing it to the main memory. This is called dirty sharing. The processor with the data in the owned state stays responsible to update the main memory later when the cache line is evicted. | ||
[http://en.wikipedia.org/wiki/MOESI_protocol MOESI] | [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] has become one of the most popular snoop-based protocols supported in AMD64 architecture. AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol. | ||
The five different states of [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol are: | The five different states of [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol are: | ||
Line 176: | Line 180: | ||
[[Image:MOESIfig.jpg|thumb|upright|center|450px|<b>Figure 3:</b> MOESI State Diagram]] | [[Image:MOESIfig.jpg|thumb|upright|center|450px|<b>Figure 3:</b> MOESI State Diagram]] | ||
===MOESI Implementation Complexities=== | |||
---- | |||
Allowing dirty sharing by implementing an owner state reduces bus traffic between the individual caches and main memory, but comes at the cost of higher complexity. MOESI has two more states than MSI, so the underlying state transition logic is more complex. In addition, MOESI has six bus=side request types, compared with three for MSI (or four, if BusUpgr is used). For more information, see p. 222-228 in the Solihin text. | |||
One complexity problem applying to a number of the protocols deals with invalidation. In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line. The other processors will thus have a mostly correct cache line, with only a word difference. This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared. This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line. One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is "not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent." In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided. | |||
Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps | |||
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated. | |||
== | ==MESIF== | ||
Intel's solution to the redundant messages sent using the MESI protocol was the '''[http://en.wikipedia.org/wiki/MESIF_protocol MESIF]''' protocol incorporated in the latest Intel multi-core processors Core i7, and accommodating the point-to-point links used in the [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect]. Using the MESI protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. | |||
As a solution to this problem, an additional state, Forward state, was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist. This advantage is diagrammatically depicted in figure 5. Also, on a read request, the F state transitions to the S state. That is, when a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to the S state. Moving the new copy to the F state exploits both temporal and spatial locality. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several cores. | |||
All M to S state transitions and E to S state transitions will now be from M to F and E to F. | |||
The F state is different from the Owned state of the MOESI protocol as it is not a unique' copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired. | |||
[[Image:MESIFVMESI.jpg|thumb|upright|center|450px|<b>Figure 5:</b> Reduced Traffic with MESIF Protocol [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4606981&tag=1]]] | |||
===MESIF Implementation Complexities=== | |||
---- | |||
Intel has used the MESIF protocol in their recent Nehalem architecture, in an attempt to reduce the amount of snooping traffic between caches. They accomplish this by using a large L3 cache that is shared between all the processor caches to keep track of which cores are using data. If the L3 cache has been modified, each core using that data will need to update its L1/L2 caches with the new value. | |||
MESIF uses the new "Forward" state to designate a single cache which will respond to all read requests for the shared data. This protocol is very similar in complexity to the MESI protocol, and actually reduces the amount of communication between cores. For more details, see [http://rolfed.com/nehalem/nehalemPaper.pdf Cache Organization and Memory Management of the Intel Nehalem Computer Architecture]. | |||
=Protocol Performance= | =Protocol Performance= | ||
Line 194: | Line 211: | ||
While making definitive design decisions for machine designed to run a variety of programs is challenging, the performance of various cache parameters for specific programs can be simulated allowing for the prediction of general trends. Figure 5 shows the performance of the MESI protocol (III) and the standard MSI 3 state protocol (3st). The MSI protocol when BusRdX instead of BusUpgr is used for S -> M transitions is also shown (3st-RdX) | While making definitive design decisions for machine designed to run a variety of programs is challenging, the performance of various cache parameters for specific programs can be simulated allowing for the prediction of general trends. Figure 5 shows the performance of the MESI protocol (III) and the standard MSI 3 state protocol (3st). The MSI protocol when BusRdX instead of BusUpgr is used for S -> M transitions is also shown (3st-RdX) | ||
[[Image: | [[Image:ProtocolTraffic2.jpg|thumb|upright|center|800px|<b>Figure 5:</b> Traffic for various protocols [http://ams.ict.ac.cn/wp-content/uploads/2012/03/Parallel-computer-architecture.pdf]]] | ||
Simulation results yield somewhat unexpected results with the MESI protocol showing negligible traffic savings over the MSI protocol with BusUpgr. The addition of the Exclusive state is expected to reduce traffic however simulation reveals that there is only a small fraction of E->M transitions and thus little performance gain by the addition of the E state. This suggests that a more complex protocol may not lead to substantial savings because of the link between program structure and protocol. While additional complexity doesn’t always provide savings, it is evident that the added complexity of the BusUpgr command (2nd bar) reduces traffic of the RdEx command (3rd bar). This again reaffirms that design decisions are part art, part science. | |||
=Protocol Implementation on Real Machines= | =Protocol Implementation on Real Machines= | ||
==Intel== | ==Intel== | ||
Line 201: | Line 221: | ||
The | The Pentium Pro microprocessor, introduced in 1992 was the first Intel architecture microprocessor to support symmetric multiprocessing ('''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''') in various multiprocessor configurations. '''[http://en.wikipedia.org/wiki/Symmetric_multiprocessing SMP]''' and '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the '''[http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnect]''' which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of '''[http://en.wikipedia.org/wiki/MESI_protocol MESI]''' protocol called MESIF, by introducing an additional state, F, the forward state. | ||
The | The Intel architecture uses the '''[http://en.wikipedia.org/wiki/MESI_protocol MESI'''] protocol as the basis to ensure cache coherence, which is true whether you're on one of the older processors that use a common bus to communicate or using the new Intel '''[http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath]''' point-to-point interconnection technology. | ||
=== CMP Implementation in Intel Architecture === | === CMP Implementation in Intel Architecture === | ||
Line 218: | Line 238: | ||
In this structure we have, | In this structure we have, | ||
* A unified on-chip '''[http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches L1 cache]''' with the | * A unified on-chip '''[http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches L1 cache]''' with the processor/core, | ||
* A | * A Memory/L2 access control unit, through which all the accesses to the L2 cache, main memory and IO space are made, | ||
* The second level '''[http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches L2 cache]''' along with the | * The second level '''[http://en.wikipedia.org/wiki/CPU_cache#Multi-level_caches L2 cache]''' along with the prefetch unit and | ||
* '''[http://en.wikipedia.org/wiki/Front-side_bus Front side bus (FSB)]''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. | * '''[http://en.wikipedia.org/wiki/Front-side_bus Front side bus (FSB)]''', a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time. | ||
As Intel explains it, using this structure, the processor requests were first sought in the | As Intel explains it, using this structure, the processor requests were first sought in the L2 cache and only on a miss, were they forwarded to the main memory via the front side bus (FSB). The Memory/L2 access control unit served as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system. | ||
'''CMP Architecture''' | '''CMP Architecture''' | ||
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the | For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the directory protocol because though directory protocol reduces the active power due to reduced snoop activity, it increased the design complexity and the static power due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power. | ||
Let us examine how | Let us examine how CMP was implemented in Intel Core Duo, which was one of the first dual-core processor for the budget/entry-level market. | ||
The general CMP implementation structure of the Intel Core Duo is shown below | The general CMP implementation structure of the Intel Core Duo is shown below | ||
Line 235: | Line 255: | ||
This structure has the following changes when compared to the uniprocessor memory cluster structure. | This structure has the following changes when compared to the uniprocessor memory cluster structure. | ||
* | * L1 cache and the processor/core structure is duplicated to give 2 cores. | ||
* The | * The Memory/L2 access control unit is split into 2 logical units: L2 controller and bus controller. The L2 controller handles all requests to the L2 cache from the core and the snoop requests from the FSB. The bus controller handles data and I/O requests to and from the FSB. | ||
* The | * The prefetching unit is extended to handle the hardware prefetches for each core separately. | ||
* A | * A new logical unit(represented by the hexagon) was added to maintain fairness between the requests coming from the different cores and hence balance the requests to L2 and memory. | ||
This new | This new partitioned structure for the memory/L2 access control unit enhanced the performance while reducing power consumption. | ||
For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to | For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to | ||
[http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors] | [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors] | ||
The | The Intel bus architecture has been evolving in order to accommodate the demands of scalability while using the '''same [http://en.wikipedia.org/wiki/MESI_protocol MESI] protocol'''; From using a single shared bus to dual independent buses (DIB) doubling the available bandwidth and to the logical conclusion of DIB with the introduction of dedicated high-speed interconnects (DHSI). The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the [http://en.wikipedia.org/wiki/Xeon Intel Xeon] processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using '''Intel [http://en.wikipedia.org/wiki/Intel_QuickPath_Interconnect QuickPath Interconnects] using MESIF protocol'''. | ||
==AMD - Advanced Micro Devices Processors== | ==AMD - Advanced Micro Devices Processors== | ||
Line 266: | Line 286: | ||
<br /> | <br /> | ||
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching, there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N: | An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching, there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N: | ||
# The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache | # The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache are invalidated. | ||
# Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M. | # Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M. | ||
# Data in virtual-page A is accessed. | # Data in virtual-page A is accessed. | ||
Line 280: | Line 300: | ||
In real machines, using some optimization techniques on the standard cache coherence protocol used, improves the performance of the machine. For example, [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die. It is the first to have a cache that all the cores share, using the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol with some optimization techniques incorporated. | In real machines, using some optimization techniques on the standard cache coherence protocol used, improves the performance of the machine. For example, [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die. It is the first to have a cache that all the cores share, using the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol with some optimization techniques incorporated. | ||
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The | It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The MOESI protocol that the [http://en.wikipedia.org/wiki/AMD_Phenom AMD Phenom] cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the Mstate for such computing problems, we can achieve better performance. | ||
When the producer thread , writes a new entry, it allocates cache-lines in the | When the producer thread , writes a new entry, it allocates cache-lines in the M state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI] protocol changes the state of the cache line to the O state in the L3 cache and pulls down a shared (S) copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked ‘O’) cache line, it finds that it cannot, since a cache line in the O state by the previous consumer read does not have sufficient permission for a write request (in the [http://en.wikipedia.org/wiki/MOESI_protocol MOESI]). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process. | ||
Thus, it is preferable to keep the cache line in the | Thus, it is preferable to keep the cache line in the M state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still in the M state, to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines. | ||
You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]. | You can find more information on how this is implemented and various other ways of optimizations in this manual [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors]. | ||
=References= | =References= | ||
# [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4606981 | # [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4606981 The research of the inclusive cache used in multi-core processor] | ||
# [http://books.google.com/books?id=MHfHC4Wf3K0C&pg=PA620&lpg=PA620&dq=Culler+and+Singh+1998&source=bl&ots=1KHO-a3JXQ&sig=ZEZ0Xjer6Y4IJJgm6GXjnerYnkI&hl=en&sa=X&ei=ko9oT4LfH4aPsQKItY2wCQ&ved=0CCsQ6AEwAQ#v=onepage&q=Culler%20and%20Singh%201998&f=false Parallel Computer Architecture: A Hardware/Software Approach] | # [http://books.google.com/books?id=MHfHC4Wf3K0C&pg=PA620&lpg=PA620&dq=Culler+and+Singh+1998&source=bl&ots=1KHO-a3JXQ&sig=ZEZ0Xjer6Y4IJJgm6GXjnerYnkI&hl=en&sa=X&ei=ko9oT4LfH4aPsQKItY2wCQ&ved=0CCsQ6AEwAQ#v=onepage&q=Culler%20and%20Singh%201998&f=false Parallel Computer Architecture: A Hardware/Software Approach] | ||
# [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence] | # [http://en.wikipedia.org/wiki/Cache_coherence Cache coherence] | ||
Line 294: | Line 314: | ||
# [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors] | # [http://www.intel.com/technology/itj/2006/volume10issue02/art02_CMP_Implementation/p03_implementation.htm CMP Implementation in Intel Core Duo Processors] | ||
# [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors] | # [http://en.wikipedia.org/wiki/Symmetric_multiprocessing Common System Interface in Intel Processors] | ||
# [http://techreport.com/articles.x/8236/2 AMD dual core Architecture] | # [http://techreport.com/articles.x/8236/2 AMD dual core Architecture] | ||
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual] | # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf AMD64 Architecture Programmer's manual] | ||
# [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors] | # [http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf Software Optimization guide for AMD 10h Processors] | ||
# [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core] | # [http://www.chip-architect.com/news/2003_09_21_Detailed_Architecture_of_AMDs_64bit_Core.html Architecture of AMD 64 bit core] | ||
# [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols] | # [http://www.freepatentsonline.com/5283886.html Three state invalidation protocols] | ||
# [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems] | # [http://portal.acm.org/citation.cfm?id=1499317&dl=GUIDE&coll=GUIDE&CFID=83027384&CFTOKEN=95680533 Synapse tightly coupled multiprocessors: a new approach to solve old problems] | ||
# [http:// | # [http://ctho.org/toread/forclass/18-742/3/p273-archibald.pdf Coherence protocols: evaluation using a multiprocessor simulation model] | ||
# [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] | # [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] | ||
# [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon] | # [http://en.wikipedia.org/wiki/Xerox_Dragon Xerox Dragon] | ||
# [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=289691 XDBus] | |||
# [http://ieeexplore.ieee.org | |||
# [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.7463&rep=rep1&type=pdf Power Efficient Cache Coherence] | # [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.7463&rep=rep1&type=pdf Power Efficient Cache Coherence] | ||
# [http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032&p=5 The Common System Interface: Intel's Future Interconnect] | # [http://www.realworldtech.com/page.cfm?ArticleID=RWT082807020032&p=5 The Common System Interface: Intel's Future Interconnect] | ||
# [http://support.amd.com/us/Processor_TechDocs/24593.pdf AMD64 Architecture Programmer's Manual Vol 2 'System Programming'] | # [http://support.amd.com/us/Processor_TechDocs/24593.pdf AMD64 Architecture Programmer's Manual Vol 2 'System Programming'] | ||
# [http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5375423&queryText%3D%28%28mesi%29+AND+moesi%29%26openedRefinements%3D*%26matchBoolean%3Dtrue%26searchField%3DSearch+All+Text 'Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems' by Hackenberg, D., Molka, D. and Nagel, W.E.] | # [http://ieeexplore.ieee.org/search/srchabstract.jsp?tp=&arnumber=5375423&queryText%3D%28%28mesi%29+AND+moesi%29%26openedRefinements%3D*%26matchBoolean%3Dtrue%26searchField%3DSearch+All+Text 'Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems' by Hackenberg, D., Molka, D. and Nagel, W.E.] |
Latest revision as of 14:52, 29 March 2012
Introduction to bus-based cache coherence in real machines
Most parallel software in the commercial market relies on the shared-memory programming model in which all processors access the same physical address space. The most common multiprocessors today use SMP architecture which use a common bus as the interconnect. In the case of multicore processors ("chip multiprocessors," or CMP) the SMP architecture applies to the cores treating them as separate processors. The key problem of shared-memory multiprocessors is providing a consistent view of memory with various cache hierarchies. This is called the cache coherence problem. It is critical to achieve correctness and performance-sensitive design point for supporting the shared-memory model. The cache coherence mechanisms not only govern communication in a shared-memory multiprocessor, but also typically determine how the memory system transfers data between processors, caches, and memory.
Summary of Cache Coherence Protocols
At any point in logical time, the permissions for a cache block can allow either a single writer or multiple readers. The coherence protocol ensures the invariance of the states are maintained. The different coherent states used by most of the cache coherent protocols are shown in Table 1:
States | Access Type | Invariant |
Modified | read, write | all other caches in I state |
Exclusive | read | all other caches in I state |
Owned | read | all other caches in I or S state |
Shared | read | no other cache in M or E state |
Invalid | - | - |
The first widely adopted approach to cache coherence uses snooping on a bus. Snooping entails each processor seeing or "snooping" all bus transactions. Common bus transactions are
- BusRd The cache controller asks for a copy of the data it does not intend to modify
- Bus Read-Exclusive (BusRdX) The cache controller asks for a copy of the data it intends to modify (all other caches need to be invalidated)
- Writeback (BusWB) Main memory updates its contents
- Bus upgrade (BusUpgr) Bus upgrade obtains exclusive ownership and invalidates other copies - bus does not return data to the requestor
MSI
MSI protocol, one of the earliest snooping-based cache coherence-protocols, is a three-state write-back invalidation protocol. It marks the cache line with one of three possible states: Modified (M), Shared (S), and Invalid (I). If a cache line is dirty and the processor has exclusive ownership of it, it is in Modified state. If the cache line is clean and is shared by more than one processor , it is marked as Shared. Invalid means the cache line is either not present or is in invalid state. BusRdx causes all other processors to invalidate (demote) their copy of the cache line to the I state. If the cache line is present in the M state in another cache, it will flush.
The following state transition diagram for MSI protocol explains the working of the protocol:
MSI Implementation Complexities
MSI is the most basic cache coherence protocols, and also the easiest to implement. However, the MSI protocol requires that all state changes occur atomically to ensure cache coherency, and is therefore not recommended for real-world, complex systems.[3] Adding an exclusive, owner, or forwarding state to the architecture increases the cache coherence design complexity, but greatly improves performance. For this reason, most modern multiprocessors use a variant of MSI (MESI, MOSI, or MOESI). [4]
MESI
MSI protocol has a major drawback in that each read-write sequence incurs two bus transactions irrespective of whether the cache line is stored in only one cache or not. This is a huge setback for highly parallel programs that have little data sharing. MESI protocol solves this problem by introducing the E (Exclusive) state to distinguish between a cache line stored in multiple caches and a line stored in a single cache. Let us briefly look at how the MESI protocol works. (For a more detailed version readers are referred to Solihin textbook pg. 215)
MESI coherence protocol marks each cache line as being in one of the Modified, Exclusive, Shared or Invalid states.
- Invalid : The cache line is either not present or is invalid
- Exclusive : The cache line is clean and is owned by this core/processor only
- Modified : This implies that the cache line is dirty and the core/processor has exclusive ownership of the cache line, exclusive of the memory also.
- Shared : The cache line is clean and is shared by more than one core/processor
The MESI protocol works as follows: A line that is just fetched receives E or S state depending on whether it exists in other processors in the system or not. Similarly, a cache line gets the M state when a processor writes to it. If the line is not in E or M state prior to being written to, the cache sends a Bus Upgrade (BusUpgr) signal or, as the Intel manuals call it, a “Read-For-Ownership (RFO) request” which ensures that the line exists in the cache and is in the I state in all other processors on the bus (if any). The table shown below will summarize the MESI protocol.
Cache Line State: | Modified | Exclusive | Shared | Invalid |
This cache line is valid? | Yes | Yes | Yes | No |
The memory copy is… | out of date | valid | valid | - |
Copies exist in caches of other processors? | No | No | Maybe | Maybe |
A write to this line | does not go to bus | does not go to bus | goes to bus and updates cache | goes directly to bus |
Table 2: MESI Protocol Summary
MESI Implementation Complexities
In 1984, researchers at the University of Illinois developed and published a new, improved cache coherence protocol. This protocol added an "exclusive" state to the MSI protocol. The new state improves performance by reducing bus traffic when data is being used by a single processor. For more information on MESI, here is an excerpt of the original article: A low-overhead coherence solution for multiprocessors with private cache memories
The improved performance comes at the cost of greater implementation complexity. The MESI protocol requires an additional COPIES-EXIST bus line, which is not necessary when using MSI. Transferring data between caches is also more complicated, especially the FlushOpt (a snooped request used to indicate that a cache block has been posted on the bus for another processor). In the worst case, several cache controllers may attempt to flush the same cache block to the bus, only to realize that another controller has already completed the FlushOpt. They will have wasted time and power by reading the cache, trying to access the bus, and then canceling the operation. For more details, see pages 215-220 of the Solihin textbook.
MOESI
AMD Opteron was AMD’s first-generation dual core processor which had two distinct K8 cores together on a single die. Cache coherence produces bigger problems on such multicore processors. It was necessary to use an appropriate coherence protocol to address this problem. The Intel Xeon, which was the competitive counterpart from Intel to AMD dual core Opteron, used the MESI protocol to handle cache coherence. MESI came with the drawback of using too much time and bandwidth in certain situations.
MOESI was AMD’s answer to this problem. MOESI added a fifth state to MESI protocol called the owned state. MOESI protocol addresses the bandwidth problem faced by MESI protocol when processor having invalid data in its cache wants to modify the data. The processor seeking the data will have to wait for the processor which modified this data to write back to the main memory, which takes time and bandwidth. This drawback is removed in MOESI by allowing dirty sharing. When the data is held by a processor in the new state owned, it can provide other processors the modified data without writing it to the main memory. This is called dirty sharing. The processor with the data in the owned state stays responsible to update the main memory later when the cache line is evicted.
MOESI has become one of the most popular snoop-based protocols supported in AMD64 architecture. AMD dual-core Opteron can maintain cache coherence in systems up to 8 processors using this protocol.
The five different states of MOESI protocol are:
- Modified (M): The most recent copy of the data is present in the cache line. But it is not present in any other processor cache.
- Owned (O): The cache line has the most recent correct copy of the data. This can be shared by other processors. The processor in this state for this cache line is responsible to update the correct value in the main memory before it gets evicted.
- Exclusive (E): A cache line holds the most recent, correct copy of the data, which is exclusively present on this processor and a copy is present in the main memory.
- Shared (S): A cache line in the shared state holds the most recent, correct copy of the data, which may be shared by other processors.
- Invalid (I): A cache line does not hold a valid copy of the data.
A detailed explanation of this protocol implementation on AMD processor can be found in the manual Architecture of AMD 64-bit core
The following table summarizes the MOESI protocol:
Cache Line State: | Modified | Owner | Exclusive | Shared | Invalid |
This cache line is valid? | Yes | Yes | Yes | Yes | No |
The memory copy is… | out of date | out of date | valid | valid | - |
Copies exist in caches of other processors? | No | No | Maybe | Maybe | - |
A write to this line | does not go to bus | does not go to bus | goes to bus and updates cache | goes directly to bus | - |
Table 3: MOESI Protocol Summary
State transition for MOESIis as shown below :
MOESI Implementation Complexities
Allowing dirty sharing by implementing an owner state reduces bus traffic between the individual caches and main memory, but comes at the cost of higher complexity. MOESI has two more states than MSI, so the underlying state transition logic is more complex. In addition, MOESI has six bus=side request types, compared with three for MSI (or four, if BusUpgr is used). For more information, see p. 222-228 in the Solihin text.
One complexity problem applying to a number of the protocols deals with invalidation. In newer protocols, individual words may be modified in a cache line, as opposed to the entirety of the line. The other processors will thus have a mostly correct cache line, with only a word difference. This leads to potential complexity, because to 'correct' the error the other processors must transition from shared to invalid, where they then can read the correct word from the bus and place it back into the line, transitioning back into shared. This transition is potentially unnecessary, as the second processor may never access the specific word changed, but it may access other words in the cache line. One potential solution to this is being researched at the present time, advancing the protocols so that a cache line is "not invalidated on the first dirty word, but after the number of dirty words crosses some predetermined value, which is data type and application dependent." In other words, if the application can possibly have multiple words 'incorrect,' several transitions to and from the invalid state may be avoided. Source: http://tab.computer.org/tcca/NEWS/sept96/dsmideas.ps
In essence, the solution proposed here is to advance the MOESI protocol with word invalidation and specific treatment of temporal and spatial data, so that the block is not invalidated.
MESIF
Intel's solution to the redundant messages sent using the MESI protocol was the MESIF protocol incorporated in the latest Intel multi-core processors Core i7, and accommodating the point-to-point links used in the QuickPath Interconnect. Using the MESI protocol in this architecture would send many redundant messages between different processors, often with unnecessarily high latency. For example, when a processor requests a cache line that is stored in multiple locations, every location might respond with the data. As the requesting processor only needs a single copy of the data, the system would be wasting the bandwidth. As a solution to this problem, an additional state, Forward state, was added by slightly changing the role of the Shared state. Whenever there is a read request, only the cache line in the F state will respond to the request, while all the S state caches remain dormant. Hence, by designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist. This advantage is diagrammatically depicted in figure 5. Also, on a read request, the F state transitions to the S state. That is, when a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to the S state. Moving the new copy to the F state exploits both temporal and spatial locality. Because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. This takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several cores. All M to S state transitions and E to S state transitions will now be from M to F and E to F. The F state is different from the Owned state of the MOESI protocol as it is not a unique' copy because a valid copy is stored in memory. Thus, unlike the Owned state of the MOESI protocol, in which the data in the O state is the only valid copy of the data, the data in the F state can be evicted or converted to the S state, if desired.
MESIF Implementation Complexities
Intel has used the MESIF protocol in their recent Nehalem architecture, in an attempt to reduce the amount of snooping traffic between caches. They accomplish this by using a large L3 cache that is shared between all the processor caches to keep track of which cores are using data. If the L3 cache has been modified, each core using that data will need to update its L1/L2 caches with the new value.
MESIF uses the new "Forward" state to designate a single cache which will respond to all read requests for the shared data. This protocol is very similar in complexity to the MESI protocol, and actually reduces the amount of communication between cores. For more details, see Cache Organization and Memory Management of the Intel Nehalem Computer Architecture.
Protocol Performance
Protocol performance is strongly linked to the program that is running on the parallel system. This complicates protocol performance evaluation and prevents the designer from definitively choosing the best protocol. This binding of program and protocol makes "Making design decisions in real systems ... part art and part science. The art is the past experience, intuition, and aesthetics of the designers, and the science is workload-driven evaluation." [5]
While making definitive design decisions for machine designed to run a variety of programs is challenging, the performance of various cache parameters for specific programs can be simulated allowing for the prediction of general trends. Figure 5 shows the performance of the MESI protocol (III) and the standard MSI 3 state protocol (3st). The MSI protocol when BusRdX instead of BusUpgr is used for S -> M transitions is also shown (3st-RdX)
Simulation results yield somewhat unexpected results with the MESI protocol showing negligible traffic savings over the MSI protocol with BusUpgr. The addition of the Exclusive state is expected to reduce traffic however simulation reveals that there is only a small fraction of E->M transitions and thus little performance gain by the addition of the E state. This suggests that a more complex protocol may not lead to substantial savings because of the link between program structure and protocol. While additional complexity doesn’t always provide savings, it is evident that the added complexity of the BusUpgr command (2nd bar) reduces traffic of the RdEx command (3rd bar). This again reaffirms that design decisions are part art, part science.
Protocol Implementation on Real Machines
Intel
MESI & Intel Processors
The Pentium Pro microprocessor, introduced in 1992 was the first Intel architecture microprocessor to support symmetric multiprocessing (SMP) in various multiprocessor configurations. SMP and MESI protocol was the architecture used consistently until the introduction of the 45-nm Hi-k Core micro-architecture in Intel's (Nehalem-EP) quad-core x86-64. The 45-nm Hi-k Intel Core microarchitecture utilizes a new system of framework called the QuickPath Interconnect which uses point-to-point interconnection technology based on distributed shared memory architecture. It uses a modified version of MESI protocol called MESIF, by introducing an additional state, F, the forward state.
The Intel architecture uses the MESI protocol as the basis to ensure cache coherence, which is true whether you're on one of the older processors that use a common bus to communicate or using the new Intel QuickPath point-to-point interconnection technology.
CMP Implementation in Intel Architecture
Let us now see how Intel architecture using the MESI protocol progressed from a uniprocessor architecture to a Chip MultiProcessor (CMP) using the bus as the interconnect.
Uniprocessor Architecture
The diagram below shows the structure of the memory cluster in Intel Pentium M processor.
In this structure we have,
- A unified on-chip L1 cache with the processor/core,
- A Memory/L2 access control unit, through which all the accesses to the L2 cache, main memory and IO space are made,
- The second level L2 cache along with the prefetch unit and
- Front side bus (FSB), a single shared bi-directional bus through which all the traffic is sent across.These wide buses bring in multiple data bytes at a time.
As Intel explains it, using this structure, the processor requests were first sought in the L2 cache and only on a miss, were they forwarded to the main memory via the front side bus (FSB). The Memory/L2 access control unit served as a central point for maintaining coherence within the core and with the external world. It contains a snoop control unit that receives snoop requests from the bus and performs the required operations on each cache (and internal buffers) in parallel. It also handles RFO requests (BusUpgr) and ensures the operation continues only after it guarantees that no other version on the cache line exists in any other cache in the system.
CMP Architecture
For CMP implementation, Intel chose the bus-based architecture using snoopy protocols vs the directory protocol because though directory protocol reduces the active power due to reduced snoop activity, it increased the design complexity and the static power due to larger tag arrays. Since Intel has a large market for the processors in the mobility family, directory-based solution was less favorable since battery life mainly depends on static power consumption and less on dynamic power. Let us examine how CMP was implemented in Intel Core Duo, which was one of the first dual-core processor for the budget/entry-level market. The general CMP implementation structure of the Intel Core Duo is shown below
This structure has the following changes when compared to the uniprocessor memory cluster structure.
- L1 cache and the processor/core structure is duplicated to give 2 cores.
- The Memory/L2 access control unit is split into 2 logical units: L2 controller and bus controller. The L2 controller handles all requests to the L2 cache from the core and the snoop requests from the FSB. The bus controller handles data and I/O requests to and from the FSB.
- The prefetching unit is extended to handle the hardware prefetches for each core separately.
- A new logical unit(represented by the hexagon) was added to maintain fairness between the requests coming from the different cores and hence balance the requests to L2 and memory.
This new partitioned structure for the memory/L2 access control unit enhanced the performance while reducing power consumption. For more information on uniprocessor and multiprocessor implementation under the Intel architecture, refer to CMP Implementation in Intel Core Duo Processors
The Intel bus architecture has been evolving in order to accommodate the demands of scalability while using the same MESI protocol; From using a single shared bus to dual independent buses (DIB) doubling the available bandwidth and to the logical conclusion of DIB with the introduction of dedicated high-speed interconnects (DHSI). The DHSI-based platforms use four FSBs, one for each processor in the platform. In both DIB and DHSI, the snoop filter was used in the chipset to cache snoop information, thereby significantly reducing the broadcasting needed for the snoop traffic on the buses. With the production of processors based on next generation 45-nm Hi-k Intel Core microarchitecture, the Intel Xeon processor fabric will transition from a DHSI, with the memory controller in the chipset, to a distributed shared memory architecture using Intel QuickPath Interconnects using MESIF protocol.
AMD - Advanced Micro Devices Processors
MOESI & AMD Processors
AMD Opteron memory Architecture
The AMD processor’s high-performance cache architecture includes an integrated, 64-bit, dual-ported 128-Kbyte split-L1 cache with separate snoop port, multi-level translation lookaside buffers (TLBs), a scalable L2 cache controller with a 72-bit (64-bit data + 8-bit ECC) interface to as much as 8-Mbyte of industry-standard SDR or DDR SRAMs, and an integrated tag for the most cost-effective 512-Kbyte L2 configurations. The AMD Athlon processor’s integrated L1 cache comprises two separate 64-Kbyte, two-way set-associative data and instruction caches.
More information about this can be found in AMD 64 bit Architecture Programmers's Manual
Special Coherence Considerations in AMD64 architectures
Instruction prefetching is a technique used to speedup the execution of the program. But in multiprocessors, prefetching comes at the cost of performance. Due to prefetching, the data can be modified in such a way that the memory coherence protocol will not be able to handle the effects. In such situations software must use serializing instructions or cache-invalidation instructions to guarantee subsequent data accesses are coherent.
An example of this type of a situation is a page-table update followed by accesses to the physical pages referenced by the updated page tables. The physical-memory references for the page tables are different than the physical-memory references for the data. Because of prefetching, there maybe problem with correctness. The following sequence of events shows such a situation when software changes the translation of virtual-page A from physical-page M to physical-page N:
- The tables that translate virtual-page A to physical-page M are now held only in main memory. The copies in the cache are invalidated.
- Page-table entry is changed by the software for virtual-page A in main memory to point to physical page N rather than physical-page M.
- Data in virtual-page A is accessed.
Software expects the processor to access the data from physical-page N after the update. However, it is possible for the processor to prefetch the data from physical-page M before the page table for virtual page A is updated. Because the physical-memory references are different, the processor does not recognize them as requiring coherence checking and believes it is safe to prefetch the data from virtual-page A, which is translated into a read from physical page M. Similar behavior can occur when instructions are prefetched from beyond the page table update instruction.
In order to prevent errors from occuring, there are special instructions provided by software like INVLPG or MOV CR3 instruction which is executed immediately after the page-table update to ensure that subsequent instruction fetches and data accesses use the correct virtual-page-to-physical-page translation. It is not necessary to perform a TLB invalidation operation preceding the table update.
More information can be found about this in AMD64 Architecture Programmer's manual
Optimization techniques on MOESI when implemented on AMD Phenom processors
In real machines, using some optimization techniques on the standard cache coherence protocol used, improves the performance of the machine. For example, AMD Phenom family of microprocessors (Family 0×10) which is AMD’s first generation to incorporate 4 distinct cores on a single die. It is the first to have a cache that all the cores share, using the MOESI protocol with some optimization techniques incorporated.
It focuses on a small subset of compute problems which behave like Producer and Consumer programs. In such a computing problem, a thread of a program running on a single core produces data, which is consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory. The MOESI protocol that the AMD Phenom cache uses for cache coherence can also limit bandwidth. Hence by keeping the cache line in the Mstate for such computing problems, we can achieve better performance.
When the producer thread , writes a new entry, it allocates cache-lines in the M state. Eventually, these M-marked cache lines will start to fill the L3 cache. When the consumer reads the cache line, the MOESI protocol changes the state of the cache line to the O state in the L3 cache and pulls down a shared (S) copy for its own use. Now, the producer thread circles the ring buffer to arrive back to the same cache line it had previously written. However, when the producer attempts to write new data to the owned (marked ‘O’) cache line, it finds that it cannot, since a cache line in the O state by the previous consumer read does not have sufficient permission for a write request (in the MOESI). To maintain coherence, the memory controller must initiate probes in the other caches (to handle any other S copies that may exist). This will slow down the process.
Thus, it is preferable to keep the cache line in the M state in the L3 cache. In such a situation, when the producer comes back around the ring buffer, it finds the previously written cache line still in the M state, to which it is safe to write without coherence concerns. Thus better performance can be achieved by such optimization techniques to standard protocols when implemented in real machines.
You can find more information on how this is implemented and various other ways of optimizations in this manual Software Optimization guide for AMD 10h Processors.
References
- The research of the inclusive cache used in multi-core processor
- Parallel Computer Architecture: A Hardware/Software Approach
- Cache coherence
- Introduction to QuickPath Interconnect
- CMP Implementation in Intel Core Duo Processors
- Common System Interface in Intel Processors
- AMD dual core Architecture
- AMD64 Architecture Programmer's manual
- Software Optimization guide for AMD 10h Processors
- Architecture of AMD 64 bit core
- Three state invalidation protocols
- Synapse tightly coupled multiprocessors: a new approach to solve old problems
- Coherence protocols: evaluation using a multiprocessor simulation model
- Dragon Protocol
- Xerox Dragon
- XDBus
- Power Efficient Cache Coherence
- The Common System Interface: Intel's Future Interconnect
- AMD64 Architecture Programmer's Manual Vol 2 'System Programming'
- 'Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems' by Hackenberg, D., Molka, D. and Nagel, W.E.