CSC/ECE 506 Spring 2010/8a sk: Difference between revisions
No edit summary |
|||
Line 20: | Line 20: | ||
The three states in the MSI protocol are used to directly enforce the cache coherence by allowing only a single processor in the modified state at a given time, allowing multiple processors in the shared state concurrently, and disallowing other processors in the shared state while a processor is in the modified state. In the MSI system, there is a serious drawback that an explicit BusRdX transaction is required for a read followed by a write, even if there are no other shared copies. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of shared copies. The first is a BusRd that gets the cache block to S state, and the second is a BusRdX(or BusUpgr) to invalidate other cached copies. The BusRdX is useless in case there are no other shared copies. | The three states in the MSI protocol are used to directly enforce the cache coherence by allowing only a single processor in the modified state at a given time, allowing multiple processors in the shared state concurrently, and disallowing other processors in the shared state while a processor is in the modified state. In the MSI system, there is a serious drawback that an explicit BusRdX transaction is required for a read followed by a write, even if there are no other shared copies. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of shared copies. The first is a BusRd that gets the cache block to S state, and the second is a BusRdX(or BusUpgr) to invalidate other cached copies. The BusRdX is useless in case there are no other shared copies. | ||
<br><br> | <br><br> | ||
== MSI in SGI IRIS 4D multiprocessor== | |||
SGI(Silicon Graphics, Inc) had used a protocol much like MSI in the IRIS 4D multiprocessor series. The bus used in this workstation is a pipelined, block transfer bus that supports the cache coherence protocol. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics super workstation. The cache coherence protocol had to support synchronization as well as sharing, so improvement in efficiency operations necessitated a compromise in the efficiency of the data sharing protocol. With the simple rules of MSI protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics super workstation. So the simple MSI protocol is well-suited to these needs. | SGI(Silicon Graphics, Inc) had used a protocol much like MSI in the IRIS 4D multiprocessor series. The bus used in this workstation is a pipelined, block transfer bus that supports the cache coherence protocol. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics super workstation. The cache coherence protocol had to support synchronization as well as sharing, so improvement in efficiency operations necessitated a compromise in the efficiency of the data sharing protocol. With the simple rules of MSI protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics super workstation. So the simple MSI protocol is well-suited to these needs. | ||
Line 45: | Line 46: | ||
[[File:MESI-state.jpg|550px]] | [[File:MESI-state.jpg|550px]] | ||
<br> | <br> | ||
== MESI in Intel's IA32 Pentium processors== | |||
MESI protocol is used in Intel's IA32 Pentium class processors. In Intel Pentium 4 IA-32 Architecture Processor, the data cache uses | MESI protocol is used in Intel's IA32 Pentium class processors. In Intel Pentium 4 IA-32 Architecture Processor, the data cache uses | ||
MESI protocol to support more efficient write-back cache in addition to the write-through cache previously used by the Intel486 processor. | MESI protocol to support more efficient write-back cache in addition to the write-through cache previously used by the Intel486 processor. | ||
Line 63: | Line 64: | ||
When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes. | When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes. | ||
<br><br> | <br><br> | ||
== MESIF in Intel Nehalem Computer== | |||
Intel Nehalem Computer uses the MESIF protocol. In the Nehalem architecture each core has its own L1 and L2 cache. Nehalem does has a shared cache, implemented as L3 cache. This cache is shared among all cores and is relatively large. This cache is inclusive, meaning that | Intel Nehalem Computer uses the MESIF protocol. In the Nehalem architecture each core has its own L1 and L2 cache. Nehalem does has a shared cache, implemented as L3 cache. This cache is shared among all cores and is relatively large. This cache is inclusive, meaning that | ||
it duplicates all data stored in each indivitual L1 and L2 cache. This duplication greatly adds to the inter-core communication efficiency because any given core does not have to locate data in another processor’s cache. If the requested data is not found in any level of the core’s cache, it knows the data is also not present in any other core’s cache. To ensure coherency across all caches, the L3 cache has | it duplicates all data stored in each indivitual L1 and L2 cache. This duplication greatly adds to the inter-core communication efficiency because any given core does not have to locate data in another processor’s cache. If the requested data is not found in any level of the core’s cache, it knows the data is also not present in any other core’s cache. To ensure coherency across all caches, the L3 cache has | ||
Line 97: | Line 99: | ||
Only blocks in state Dirty are written back when replaced. Any cache with a copy of a block in state Dirty is called the owner of that block. If no Dirty copy exists, main memory is the owner.<br> | Only blocks in state Dirty are written back when replaced. Any cache with a copy of a block in state Dirty is called the owner of that block. If no Dirty copy exists, main memory is the owner.<br> | ||
== Synapse N=1 multiprocessor== | |||
Synapse protocol was used in the Synapse N + 1, a multiprocessor for fault-tolerant transaction processing. The N + 1 differs from other shared bus designs considered in that it has two system buses. The added bandwidth of the extra bus allows the system to be expanded to a maximum of 28 processors. A single-bit tag is included with each cache block in main memory, indicating whether main memory is to respond to a miss on that block. If a cache has a modified copy of the block, the bit tells the memory that it need not respond. Thus if the bit is set, an access to the block fails until the data is written back to memory by the cache with the exclusive copy and the bit is reset. This prevents a possible race condition if a cache does not respond quickly enough to inhibit main memory from responding. | Synapse protocol was used in the Synapse N + 1, a multiprocessor for fault-tolerant transaction processing. The N + 1 differs from other shared bus designs considered in that it has two system buses. The added bandwidth of the extra bus allows the system to be expanded to a maximum of 28 processors. A single-bit tag is included with each cache block in main memory, indicating whether main memory is to respond to a miss on that block. If a cache has a modified copy of the block, the bit tells the memory that it need not respond. Thus if the bit is set, an access to the block fails until the data is written back to memory by the cache with the exclusive copy and the bit is reset. This prevents a possible race condition if a cache does not respond quickly enough to inhibit main memory from responding. | ||
Revision as of 03:10, 19 March 2012
Introduction
MSI
MSI protocol is a three-state write-back invalidation protocol which is one of the simplest and earliest-used snooping-based cache coherence-protocols. According to this protocol, a cache block can be in one of the Modified (M), Shared (S), and Invalid (I) states.
- Modified state indicates that the variable in the cache has been modified and therefore has a different value from that found in the main memory. A cache line in this state thus needs to be written back to the block to memory during eviction.
- Shared state indicates that the block exists in one or more caches, and is clean, implying that its value is consistent with the one in main memory. So a cache block in this state can be evicted without having to write it back to the main memory.
- Invalid state indicates that the cache block is invalid.
In MSI protocol, processor requests to the cache include:
- PrRd: Processor requests read to a cache block.
- PrWr: Processor requests write to a cache block.
In MSI protocol Bus-side requests include:
- BusRd: BusRd transaction is generated by a PrRd that misses in the cache, and the processor expects a data response as a result. The cache controller puts the address on the bus and asks for a copy that it does not intend to modify. The memory supplies the data.
- BusRdX: BusRdX transaction is generated by a PrWr to a block that is either not in the cache or is in the cache but is not in modified state. The cache controller puts the address on the bus and asks for an exclusive copy that it intends to modify. The memory system provides the data. All other caches are invalidated. Once the cache obtains the exclusive copy, the write can be performed in the cache.
- Flush: Flush is a snooped request that indicates that an entire cache block is written back to the main memory by another processor.
The MSI states under different conditions are shown below:
The three states in the MSI protocol are used to directly enforce the cache coherence by allowing only a single processor in the modified state at a given time, allowing multiple processors in the shared state concurrently, and disallowing other processors in the shared state while a processor is in the modified state. In the MSI system, there is a serious drawback that an explicit BusRdX transaction is required for a read followed by a write, even if there are no other shared copies. When a processor reads in and modifies a data item, two bus transactions are generated in this protocol even in the absence of shared copies. The first is a BusRd that gets the cache block to S state, and the second is a BusRdX(or BusUpgr) to invalidate other cached copies. The BusRdX is useless in case there are no other shared copies.
MSI in SGI IRIS 4D multiprocessor
SGI(Silicon Graphics, Inc) had used a protocol much like MSI in the IRIS 4D multiprocessor series. The bus used in this workstation is a pipelined, block transfer bus that supports the cache coherence protocol. The 4D-MP graphics superworkstation brought 40 MIPS(million instructions per second) of computing performance to a graphics super workstation. The cache coherence protocol had to support synchronization as well as sharing, so improvement in efficiency operations necessitated a compromise in the efficiency of the data sharing protocol. With the simple rules of MSI protocol enforced by hardware, efficient synchronization and efficient data sharing are achieved in a simple shared memory model of parallel processing in the 4D-MP graphics super workstation. So the simple MSI protocol is well-suited to these needs.
MESI
MESI protocol is a 4-state protocol, in which a cache block can have an 'Exclusive' state apart from the Modified, Shared and Invalid state as in MSI protocol.
- Modified : The cache block is possibly dirty,valid only in one cache, and the value may differ from the value in main memory so may need to be written back to main memory upon eviction.
- Exclusive : The cache block is clean, valid and exists only in one cache.
- Shared : The cache block exists in more than one cache and is clean.
- Invalid : The value of the cache block is invalid.
In the MESI protocol, the same as the MSI protocol, processor requests to the
cache include:
- PrRd: processor-side request to read to a cache block
- PrWr: processor-side request to write to a cache block
Bus-side request, include:
- BusRd: snooped request that indicates there is a read request to a cache block made by another processor.
- BusRdX: snooped request that indicates there is a read exclusive (write)request to a cache block made by another processor which does not already have the block.
- BusUpgr: snooped request that indicates that there is a write request to a cache block that another processor already has in its cache.
- Flush: snooped request that indicates that an entire cache block is written back to the main memory by another processor.
- FlushOpt: snooped request that indicates that an entire cache block is posted on the bus in order to supply it to another processor. We distinguish FlushOpt from Flush because while Flush is needed for write propagation, FlushOpt is not required for correctness. It is implemented as a performance enhancing feature that can be removed without impacting correctness. It is referred to as an optional block flush cache-to-cache transfer.
MESI states under different conditions is shown in the figure below:
MESI in Intel's IA32 Pentium processors
MESI protocol is used in Intel's IA32 Pentium class processors. In Intel Pentium 4 IA-32 Architecture Processor, the data cache uses
MESI protocol to support more efficient write-back cache in addition to the write-through cache previously used by the Intel486 processor.
According to Intel's processor manual, IA-32 processors (beginning with the Pentium processor) and Intel 64 processors use MESI (modified, exclusive, shared, invalid) cache protocol to maintain consistency with internal caches and caches in other processors.
The figure below represents the cache structure of Intel Pentium 4 processors:
In the L1 data cache and in the L2/L3 unified caches, the MESI (modified, exclusive, shared, invalid) cache protocol maintains consistency with caches of other processors. The L1 data cache and the L2/L3 unified caches have two MESI status flags per cache line. Each line can be marked as being in one of the states defined in the table shown below. In general, the operation of the MESI protocol is transparent to programs. The L1 instruction cache in P6 family processors implements only the “SI” part of the MESI protocol, because the instruction cache is not writable. The instruction cache monitors changes in the data cache to maintain consistency between the caches when instructions are modified.
During replacement of a cache block, some MESI implementations require a message to be sent to memory when a cache line is flushed - an E to I transition, as the line was exclusively in one cache before it was removed. In alternate implementation, this replacement message could be avoided if the system is designed so that the flush of a modified /exclusive line requires an acknowledgment from the memory. However, this requires the flush to be stored in a 'write-back' buffer until the reply arrives to ensure the change is successfully propagated to memory.
There is a bandwidth trade-off in both these implementations.
MESIF
The MESIF protocol is a cache coherency and memory coherence protocol developed by Intel for cache coherent non-uniform memory architectures.The protocol consists of five states, Modified (M), Exclusive (E), Shared (S), Invalid (I) and Forward (F).
The M, E, S and I states are the same as in the MESI protocol. When a processor requests a cache line that is stored in multiple locations, every location might respond with the data. However, the requesting processor only needs a single copy of the data, so the system is wasting a bit of bandwidth. Intel's solution to this issue is rather elegant. They adapted the standard MESI protocol to include an additional state, the Forwarding (F) state, and changed the role of the Shared (S) state. In the MESIF protocol, only a single instance of a cache line may be in the F state and that instance is the only one that may be duplicated. Other caches may hold the data, but it will be in the shared state and cannot be copied. In other words, the cache line in the F state is used to respond to any read requests, while the S state cache lines are now silent. This makes the line in the F state a first amongst equals, when responding to snoop requests. By designating a single cache line to respond to requests, coherency traffic is substantially reduced when multiple copies of the data exist.
When a cache line in the F state is copied, the F state migrates to the newer copy, while the older one drops back to S. This has two advantages over pinning the F state to the original copy of the cache line. First, because the newest copy of the cache line is always in the F state, it is very unlikely that the line in the F state will be evicted from the caches. In essence, this takes advantage of the temporal locality of the request. The second advantage is that if a particular cache line is in high demand due to spatial locality, the bandwidth used to transmit that data will be spread across several nodes.
MESIF in Intel Nehalem Computer
Intel Nehalem Computer uses the MESIF protocol. In the Nehalem architecture each core has its own L1 and L2 cache. Nehalem does has a shared cache, implemented as L3 cache. This cache is shared among all cores and is relatively large. This cache is inclusive, meaning that
it duplicates all data stored in each indivitual L1 and L2 cache. This duplication greatly adds to the inter-core communication efficiency because any given core does not have to locate data in another processor’s cache. If the requested data is not found in any level of the core’s cache, it knows the data is also not present in any other core’s cache. To ensure coherency across all caches, the L3 cache has
additional flags that keep track of which core the data came from. If the data is modified in L3 cache, then the L3 cache knows if the data came from a different core than last time,and that the data in the first core needs its L1/L2 values updated with the new data. This greatly reduces the amount of traditional “snooping” coherency traffic between cores.
The cache organization of a 8-core Intel Nehalem Processor is shown below:
MOESI
The cache-coherency protocol supported by the AMD64 architecture is the MOESI (modified, owned, exclusive, shared, invalid) protocol. The states of the MOESI protocol are:
- Invalid: A cache line in the invalid state does not hold a valid copy of the data. Valid copies of the data can be either in main memory or another processor cache.
- Exclusive: A cache line in the exclusive state holds the most recent, correct copy of the data. The copy in main memory is also the most recent, correct copy of the data. No other processor holds a copy of the data.
- Shared: A cache line in the shared state holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. If no other processor holds it in the owned state, then the copy in main memory is also the most recent.
- Modified: A cache line in the modified state holds the most recent, correct copy of the data. The copy in main memory is stale (incorrect), and no other processor holds a copy.
- Owned: A cache line in the owned state holds the most recent, correct copy of the data. The owned state is similar to the shared state in that other processors can hold a copy of the most recent, correct data. Unlike the shared state, however, the copy in main memory can be stale (incorrect). Only one processor can hold the data in the owned state—all other processors must hold the data in the shared state.
The MOESI state transition diagram shows the general MOESI state transitions possible with various types of memory accesses. This is a logical software view, not a hardware view, of how cache-line state transitions. Instruction-execution activity and external-bus transactions can both be used to modify the cache MOESI state in multiprocessing or multi-mastering systems. To maintain memory coherency, external bus masters (typically other processors with their own internal caches) need to acquire the most recent copy of data before caching it internally. That copy can be in main memory or in the internal caches of other bus-mastering devices. When an external master has a cache read-miss or write-miss, it probes the other mastering devices to determine whether the most recent copy of data is held in any of their caches. If one of the other mastering devices holds the most recent copy, it provides it to the requesting device. Otherwise, the most recent copy is provided
by main memory.
There are two general types of bus-master probes:
- Read probes indicate the external master is requesting the data for read purposes.
- Write probes indicate the external master is requesting the data for the purpose of modifying it.
The state transitions involving probes are initiated by other processors and external bus masters into the processor. Some read probes are initiated by devices that intend to cache the data. Others, such as those initiated by I/O devices, do not intend to cache the data.
Some processor implementations do not change the data MOESI state if the read probe is initiated by a
device that does not intend to cache the data.
State transitions involving read misses and write misses can cause the processor to generate probes into external bus masters and to read main memory. Read hits do not cause a MOESI-state change. Write hits generally cause a MOESI-state change into the modified state. If the cache line is already in the modified state, a write hit does not change its state. The specific operation of external-bus signals and transactions and how they influence a cache MOESI state are implementation dependent. For example, an implementation could convert a write miss to a WB memory type into two separate MOESI-state changes. The first would be a read-miss placing the cache line in the exclusive state. This would be followed by a write hit into the exclusive cache line, changing the cache-line state to modified.
Synapse Protocol
In Synapse protocol, cache blocks are in one of the following states:
- Invalid: Indicates that the block is invalid
- Valid: Indicates that the cache block is unmodified and possibly shared
- Dirty: indicates that the cache block is modified and there are no other copies.
Only blocks in state Dirty are written back when replaced. Any cache with a copy of a block in state Dirty is called the owner of that block. If no Dirty copy exists, main memory is the owner.
Synapse N=1 multiprocessor
Synapse protocol was used in the Synapse N + 1, a multiprocessor for fault-tolerant transaction processing. The N + 1 differs from other shared bus designs considered in that it has two system buses. The added bandwidth of the extra bus allows the system to be expanded to a maximum of 28 processors. A single-bit tag is included with each cache block in main memory, indicating whether main memory is to respond to a miss on that block. If a cache has a modified copy of the block, the bit tells the memory that it need not respond. Thus if the bit is set, an access to the block fails until the data is written back to memory by the cache with the exclusive copy and the bit is reset. This prevents a possible race condition if a cache does not respond quickly enough to inhibit main memory from responding.
Summary
References
1.http://rolfed.com/nehalem/nehalemPaper.pdf
2.http://www.cs.uwaterloo.ca/~brecht/courses/856/Possible-Readings/multicore/cache-performance-x86-2009.pdf
3.http://www.realworldtech.com/page.cfm?ArticleID=rwt082807020032&p=5
4.http://download.intel.com/support/processors/pentium4/sb/25366521.pdf
5.http://www.cse.wustl.edu/~roger/569M.s09/24593.pdf