CSC/ECE 506 Spring 2013/6a cs: Difference between revisions
(59 intermediate revisions by 2 users not shown) | |||
Line 14: | Line 14: | ||
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory. | In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory. | ||
The three-level cache was used again first with the introduction of multiple processor cores, where the L3 was added to the CPU die. It became common to have the three levels be larger in size than the next and today it is not uncommon to find Level 3 cache sizes of eight megabytes. This trend appears to continue for the foreseeable future.<sup><span id="19body">[[#19foot|[19]]]</span></sup> | |||
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years. | |||
[[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy<sup><span id="2body">[[#2foot|[2]]]</span></sup>]] | [[Image:Uniprocessor_memory_hierarchy.jpg|thumbnail|200px|Uniprocessor memory hierarchy<sup><span id="2body">[[#2foot|[2]]]</span></sup>]] | ||
[[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy<sup><span id="3body">[[#3foot|[3]]]</span></sup>]] | [[Image:Dualcore_memory_hierarchy.jpg|thumbnail|200px|Dualcore memory hierarchy<sup><span id="3body">[[#3foot|[3]]]</span></sup>]] | ||
{| border='1' class="wikitable" style="text-align:center" | {| border='1' class="wikitable" style="text-align:center" | ||
|+style="white-space:nowrap"|Table 1: Cache on different Microprocessors | |+style="white-space:nowrap"|Table 1: Cache on different Microprocessors<sup><span id="13body">[[#13foot|[13]]]</span></sup><sup><span id="14body">[[#14foot|[14]]]</span></sup><sup><span id="15body">[[#15foot|[15]]]</span></sup><sup><span id="16body">[[#16foot|[16]]]</span></sup><sup><span id="17body">[[#17foot|[17]]]</span></sup><sup><span id="18body">[[#18foot|[18]]]</span></sup> | ||
|- | |- | ||
! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year | ! Company & Processor !! # cores !! L1 cache Per core !! L1 cache Per core !! L3 cache !! Year | ||
|- | |- | ||
| Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007 | | Intel Xeon Clovertown || 2 || I:4*32KB D:4*32KB || 2*4MB || - || Jan 2007 | ||
Line 52: | Line 55: | ||
| Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008 | | Azul Systems Vega3 7300 Series || 864 || 768GB || - || - || May 2008 | ||
|- | |- | ||
| | | AMD Athlon II X2 || 2 || 128 KB x 2 || 512 KB x 2 1MB x 2 || - || 2009 | ||
|- | |||
| AMD Athlon II X3 || 3 || 128 KB x 3 || 512 KB x 3 || - || 2009 | |||
|- | |||
| AMD Athlon II X4 || 4 || 128 KB x 4 || 512 KB x 4 || - || 2009 | |||
|- | |||
| AMD Phenom II X2 || 2 || 128 KB x 2 || 512 KB x 2 || 6MB || 2009 | |||
|- | |||
| AMD Phenom II X3 || 3 || 128 KB x 3 || 512 KB x 3 || 6MB || 2009 | |||
|- | |||
| AMD Phenom II X4 || 4 || 128 KB x 4 || 512 KB x 4 || 4-6MB || 2009 | |||
|- | |||
| AMD Phenom II X6 || 6 || 128 KB x 6 || 512 KB x 4 || 6MB || 2010 | |||
|- | |||
| Intel Core i3 || 2 || 32+32 KB x 2 || 256 KB x 2 || 4MB || 2010 | |||
|- | |||
| Intel Core i3 || 2 || 32+32 KB x 2 || 256 KB x 2 || 4MB || 2010 | |||
|- | |||
| Intel Core i5 - 6 Series || 2 || 32+32 KB x 2 || 256 KB x 2 || 4MB || 2010 | |||
|- | |||
| Intel Core i5 - 7 Series || 4 || 32+32 KB x 4 || 256 KB x 4 || 8 MB || 2010 | |||
|- | |||
| Intel Core i5 - 2400 Series Core i5 - 2500 Series || 4 || 32+32 KB x 4 || 256 KB x 4 || 6 MB || 2010 | |||
|- | |||
| Intel Core i7 - 8 Series || 4 || 32+32 KB x 4 || 256 KB x 4 || 8 MB || 2010 | |||
|- | |||
| Intel Core i7 - 9 Series || 4 || 32+32 KB x 4 || 256 KB x 4 || 8 MB || 2010 | |||
|- | |||
| Intel Core i7 - 970 || 6 || 32+32 KB x 6 || 256 KB x 6 || 12 MB || 2010 | |||
|- | |||
| Sun SPARC T4 || 8 || 8 x 16 + 16 || 8 x 128 KB || 4 MB || 2011 | |||
|- | |||
| Sun SPARC64 IXfx || 16 || 32 KB x 16 || 12 MB || - || 2011 | |||
|- | |||
| Intel i7-2600 || 4 || 32 x 16 KB x 4 || 256 KB x 4 || 8MB || 2011 | |||
|- | |||
| Intel i7-3970X || 6 || 32 x 16 KB x 6 || 256 KB x 6 || 15MB || November 2012 | |||
|- | |||
| Intel i5-3330 || 4 || 32 x 32 KB x 4 || 256 KB x 4 || 6MB || 2012 | |||
|- | |||
| Intel i5-3570K || 4 || 32 x 32 KB x 4 || 256 KB x 4 || 6MB || 2012 | |||
|- | |||
| Intel i7-3820 || 4 || 32 x 32 KB x 4 || 256 KB x 4 || 10MB || February 2012 | |||
|- | |||
| Intel i7-3540M || 2 || 32 x 32 KB x 2 || 256 KB x 2 || 4MB || January 2013 | |||
|- | |||
| Intel i5-3339Y || 2 || 32 x 32 KB x 2 || 256 KB x 2 || 3MB || January 2013 | |||
|} | |} | ||
Note: The 'x #' in the L1 and L2 columns indicates that this cache is for each core. | |||
The next two sections will discuss cache write policies and cache prefetching as techniques to improve the performance of these complex caching architectures by reducing write-miss and read-miss rates. | |||
=Cache Write Policies= | =Cache Write Policies= | ||
Line 86: | Line 139: | ||
*Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache. | *Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache. | ||
*Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. | *Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first. | ||
*Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache. | *Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache. | ||
==Combination Policies== | ==Combination Policies== | ||
* Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies. | The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time. | ||
Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low. | |||
* Write-validate: It is a combination of no-fetch-on-write and write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies. | |||
* Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies. | * Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies. | ||
* Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. | * Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit<sup><span id="4body">[[#4foot|[4]]]</span></sup>. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies. | ||
The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. | |||
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement. | Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement. | ||
* Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy. | |||
=Prefetching= | =Prefetching= | ||
Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the | [http://en.wikipedia.org/wiki/Instruction_prefetch Prefetching] is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used. Goal here is to reduce cache misses to improve overall performance.<br><br> | ||
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. | Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data<sup><span id="8body">[[#8foot|[8]]]</span></sup>. | ||
==Advantages== | ==Advantages== | ||
* | *Masks data access latency caused by cache misses.<sup><span id="20body">[[#20foot|[20]]]</span></sup> | ||
*Improves the performance between the processor and the memory.<sup><span id="20body">[[#20foot|[20]]]</span></sup> | |||
==Disadvantages== | ==Disadvantages== | ||
* Wastes bandwidth when prefetched data is not used. | * Wastes bandwidth when prefetched data is not used. | ||
Line 135: | Line 193: | ||
Following are two examples of processors that support prefetching directly in the chip design: | Following are two examples of processors that support prefetching directly in the chip design: | ||
== Intel Core i7<sup><span id="11body">[[#11foot|[11]]]</span></sup>== | ===Prefetching in Intel Core i7<sup><span id="11body">[[#11foot|[11]]]</span></sup>=== | ||
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip. | The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip. | ||
Line 142: | Line 200: | ||
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory. | The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory. | ||
== AMD<sup><span id="12body">[[#12foot|[12]]]</span></sup>== | ===Prefetching in AMD<sup><span id="12body">[[#12foot|[12]]]</span></sup>=== | ||
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.) | The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.) | ||
Line 151: | Line 209: | ||
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor. | The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor. | ||
=Cache Coherence Support= | |||
<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache Coherence] (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a [http://en.wikipedia.org/wiki/Shared_memory Shared memory] multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion. | |||
Cache coherence is achieved if the following conditions are met. | |||
* If a processor P1 writes a value A to a memory location X and reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between. | |||
* If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between. | |||
* Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B. | |||
==Software vs. Hardware solutions== | |||
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance. | |||
<ref>[ftp://ftp.cs.wisc.edu/markhill/Papers/isca91_coherence.pdf Comparison of Hardware and Software Cache Coherence Schemes] Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes. | |||
==Cache Coherence Schemes – Fetch and Replacements== | |||
===Invalidation Schemes vs. Update Strategies<ref>[https://parasol.tamu.edu/~rwerger/Courses/654/cachecoherence1.pdf Cache Coherence] Josep Torrellas</ref>=== | |||
* [http://en.wikipedia.org/wiki/Cache_invalidation Invalidation]: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data. | |||
* Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2. | |||
The strategies followed for fetch and replacement in some of the schemes are discussed below. | |||
===Snoopy Cache Coherence Schemes=== | |||
#[http://en.wikipedia.org/wiki/Bus_sniffing Snooping]g<ref>[http://en.wikipedia.org/wiki/Cache_coherence Cache coherence Wiki]</ref> is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location. | |||
#Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism. | |||
#When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries. | |||
===Directory Based Cache Coherence Schemes=== | |||
#In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry. | |||
#Fetch and Replacement Scenarios<ref>[http://www.di.unipi.it/~vannesch/SPA%202010-11/Silvia.pdf Cache Coherence Techniques] Silvia Lametti</ref>: | |||
##Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs. | |||
##*Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared. | |||
##*Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached. | |||
##Block is Shared: The block requested by a Processor P1 is in shared state. | |||
##*Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data. | |||
##*Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1. | |||
##Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively. | |||
##*Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1. | |||
##*Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied | |||
##*Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2. | |||
===Write Through Schemes=== | |||
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item. | |||
===Write-Back/Ownership Schemes=== | |||
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth. | |||
===Pointer-Based Coherence Schemes<ref>[http://courses.engr.illinois.edu/cs533/reading_list/1c.pdf Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes] Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry</ref>=== | |||
#The Full Bit Vector Schemes | |||
#*This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block. | |||
#*When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do. | |||
#Limited Pointer Schemes | |||
#*Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block. | |||
#*An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry. | |||
==Cache Coherency protocol== | |||
A '''coherency protocol''' is a protocol which maintains the consistency between all the caches in a system of [http://en.wikipedia.org/wiki/Distributed_shared_memory distributed shared memory]. The protocol maintains [http://en.wikipedia.org/wiki/Memory_coherence memory coherence] according to a specific [http://en.wikipedia.org/wiki/Consistency_model consistency model]. Older multiprocessors support the [http://en.wikipedia.org/wiki/Sequential_consistency sequential consistency] model, while modern shared memory systems typically support the [http://en.wikipedia.org/wiki/Release_consistency release consistency] or [http://en.wikipedia.org/wiki/Weak_consistency weak consistency] models. | |||
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors. | |||
Various models and protocols have been devised for maintaining coherence, such as [http://en.wikipedia.org/wiki/MSI_protocol MSI], [http://en.wikipedia.org/wiki/MESI_protocol MESI] (aka Illinois), [http://en.wikipedia.org/wiki/MOSI_protocol MOSI], [http://en.wikipedia.org/wiki/MOESI_protocol MOESI], [http://en.wikipedia.org/wiki/MERSI_protocol MERSI], [http://en.wikipedia.org/wiki/MESIF_protocol MESIF], [http://en.wikipedia.org/wiki/Write-once_(cache_coherence) Write-once (cache coherence)], and Synapse, Berkeley, [http://en.wikipedia.org/wiki/Firefly_protocol Firefly] and [http://en.wikipedia.org/wiki/Dragon_protocol Dragon]'''. | |||
=References= | =References= | ||
Line 165: | Line 286: | ||
<span id="11foot">[[#11body|11.]]</span> "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf <br /> | <span id="11foot">[[#11body|11.]]</span> "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf <br /> | ||
<span id="12foot">[[#12body|12.]]</span> "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf <br /> | <span id="12foot">[[#12body|12.]]</span> "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf <br /> | ||
<span id="13foot">[[#13body|13.]]</span> http://en.wikipedia.org/wiki/SPARC <br /> | |||
<span id="14foot">[[#14body|14.]]</span> http://en.wikipedia.org/wiki/Athlon_II<br /> | |||
<span id="15foot">[[#15body|15.]]</span> http://en.wikipedia.org/wiki/Phenom_II<br /> | |||
<span id="16foot">[[#16body|16.]]</span> http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors<br /> | |||
<span id="17foot">[[#17body|17.]]</span> http://en.wikipedia.org/wiki/List_of_Intel_microprocessors | |||
<br /> | |||
<span id="18foot">[[#18body|18.]]</span> http://www.cpu-world.com/ | |||
<br /> | |||
<span id="19foot">[[#19body|19.]]</span> http://en.wikipedia.org/wiki/CPU_cache#History | |||
<br /> | |||
<span id="20foot">[[#20body|20.]]</span> http://www.multicoreinfo.com/prefetching-multicore-processors/ | |||
<br /> | |||
=Other References= | |||
<references/> |
Latest revision as of 07:43, 4 March 2013
Cache Hierarchy
In a simple computer model, processor reads data and instructions from the memory and operates on the data. Operating frequency of CPU increased faster than the speed of memory and memory interconnects. For example, cores in Intel first generation i7 processors run at 3.2 GHz frequency, while the memory only runs at 1.3GHz frequency. Also, multi-core architecture started putting more demand on memory bandwidth. This increases the latency in memory access and CPU will have to be idle for most of the time. Due to this, memory became a bottle neck in performance.
To solve this problem, “cache” was invented. Cache is simply a temporary volatile storage space like primary memory but runs at the speed similar to core frequency. CPU can access data and instructions from cache in few clock cycles while accessing data from main memory can take more than 50 cycles. In early days of computing, cache was implemented as a stand alone chip outside the processor. In today’s processors, cache is implemented on same die as core.
There can be multiple levels of caches, each cache subsequently away from the core and larger in size. L1 is closest to the CPU and as a result, fastest to excess. Next to L1 is L2 cache and then L3. L1 cache is divided into instruction cache and data cache. This is better than having a combined larger cache as instruction cache being read-only is easy to implement while data cache is read-write.
Cache management is impacted by three characteristics of modern processor architectures: multilevel cache hierarchies, multi-core and multiprocessor chip and systems designs, and shared vs. private caches.
Multi-level cache hierarchies are used by most contemporary chip designs to enable memory access to keep pace with CPU speeds that are advancing at a more rapid pace. Two-level and three-level cache hierarchies are common. L1 typically ranges from 16-64KB and provide access in 2-4 cycles. L2 typically ranges from 512KB to as much as 8MB and provides access in 6-15 cycles. L3 typically ranges from 4MB to 32MB and provides access to the data it contains within 50-30 cycles.[10]
Multi-core and multprocessor designs place additional constraints on cache management by requiring them to consider all processes running concurrently in the cache coherency, replenishment, fetching, and other management functions.
In multiprocessors with multiple levels of cache, certain cache levels may be shared by processors and others private to each processor. Typically, lower-level caches are private and high-level caches may or may not be shared. Notice that none of the examples in the below have a shared L1 cache. Cache management functions must consider both shared and private caches when reading and writing data from memory.
The three-level cache was used again first with the introduction of multiple processor cores, where the L3 was added to the CPU die. It became common to have the three levels be larger in size than the next and today it is not uncommon to find Level 3 cache sizes of eight megabytes. This trend appears to continue for the foreseeable future.[19]
Below is a table that illustrates the variety with which these characteristics have been combined in processors from four manufacturers over the past 6 years.
Company & Processor | # cores | L1 cache Per core | L1 cache Per core | L3 cache | Year |
---|---|---|---|---|---|
Intel Xeon Clovertown | 2 | I:4*32KB D:4*32KB | 2*4MB | - | Jan 2007 |
Intel Xeon 3200 series | 4 | - | 2*4MB | - | Jan 2007 |
AMD Athlon 64FX | 2 | I:64KB D:4KB Both 2-way set assoc. | 1MB 16way set assoc. | - | Jan 2007 |
AMD Athlon 64X2 | 2 | I:64KB D:4KB Both 2-way set assoc. | 512KB/1MB 16way set assoc. | 2MB | Jan 2007 |
AMD Barcelona | 4 | I:64KB D:64KB | 512KB | 2MB Shared | Aug 2007 |
Sun Microsystems Ultra Sparc T2 | 8 | I:16KB D:8KB | 4MB (8 banks) 16way set assoc. | - | Oct 2007 |
Intel Xeon Wolfdale DP | 2 | D:96KB | 6MB | - | Nov 2007 |
Intel Xeon Hapertown | 4 | D:96KB | 2*6MB | - | Nov 2007 |
AMD Phenom | 4 | I:64KB D:64KB | 512KB | 2MB Shared | Nov 2007 Mar 2008 |
Intel Core 2 Duo | 2 | I:32KB D:32KB | 2/4MB 8 way set assoc. | - | 2008 |
Intel Penryn Wolfdale DP | 4 | - | 6-12MB | 6MB | Mar 2008 Aug 2008 |
Intel Core 2 Quad Yorkfield | 4 | D:96KB | 12MB | - | Mar 2008 |
AMD Toliman | 3K10 | I:64KB D:64KB | 512KB | 2MB Shared | Apr 2008 |
Azul Systems Vega3 7300 Series | 864 | 768GB | - | - | May 2008 |
AMD Athlon II X2 | 2 | 128 KB x 2 | 512 KB x 2 1MB x 2 | - | 2009 |
AMD Athlon II X3 | 3 | 128 KB x 3 | 512 KB x 3 | - | 2009 |
AMD Athlon II X4 | 4 | 128 KB x 4 | 512 KB x 4 | - | 2009 |
AMD Phenom II X2 | 2 | 128 KB x 2 | 512 KB x 2 | 6MB | 2009 |
AMD Phenom II X3 | 3 | 128 KB x 3 | 512 KB x 3 | 6MB | 2009 |
AMD Phenom II X4 | 4 | 128 KB x 4 | 512 KB x 4 | 4-6MB | 2009 |
AMD Phenom II X6 | 6 | 128 KB x 6 | 512 KB x 4 | 6MB | 2010 |
Intel Core i3 | 2 | 32+32 KB x 2 | 256 KB x 2 | 4MB | 2010 |
Intel Core i3 | 2 | 32+32 KB x 2 | 256 KB x 2 | 4MB | 2010 |
Intel Core i5 - 6 Series | 2 | 32+32 KB x 2 | 256 KB x 2 | 4MB | 2010 |
Intel Core i5 - 7 Series | 4 | 32+32 KB x 4 | 256 KB x 4 | 8 MB | 2010 |
Intel Core i5 - 2400 Series Core i5 - 2500 Series | 4 | 32+32 KB x 4 | 256 KB x 4 | 6 MB | 2010 |
Intel Core i7 - 8 Series | 4 | 32+32 KB x 4 | 256 KB x 4 | 8 MB | 2010 |
Intel Core i7 - 9 Series | 4 | 32+32 KB x 4 | 256 KB x 4 | 8 MB | 2010 |
Intel Core i7 - 970 | 6 | 32+32 KB x 6 | 256 KB x 6 | 12 MB | 2010 |
Sun SPARC T4 | 8 | 8 x 16 + 16 | 8 x 128 KB | 4 MB | 2011 |
Sun SPARC64 IXfx | 16 | 32 KB x 16 | 12 MB | - | 2011 |
Intel i7-2600 | 4 | 32 x 16 KB x 4 | 256 KB x 4 | 8MB | 2011 |
Intel i7-3970X | 6 | 32 x 16 KB x 6 | 256 KB x 6 | 15MB | November 2012 |
Intel i5-3330 | 4 | 32 x 32 KB x 4 | 256 KB x 4 | 6MB | 2012 |
Intel i5-3570K | 4 | 32 x 32 KB x 4 | 256 KB x 4 | 6MB | 2012 |
Intel i7-3820 | 4 | 32 x 32 KB x 4 | 256 KB x 4 | 10MB | February 2012 |
Intel i7-3540M | 2 | 32 x 32 KB x 2 | 256 KB x 2 | 4MB | January 2013 |
Intel i5-3339Y | 2 | 32 x 32 KB x 2 | 256 KB x 2 | 3MB | January 2013 |
Note: The 'x #' in the L1 and L2 columns indicates that this cache is for each core.
The next two sections will discuss cache write policies and cache prefetching as techniques to improve the performance of these complex caching architectures by reducing write-miss and read-miss rates.
Cache Write Policies
In section 6.2.3[10], cache write hit policies and write miss policies were explored. The write hit policies, write-through and write-back, determine when the data written in the local cache is propagated to lower-level memory. As review, write-through writes data to the cache and memory on a write. Write-back writes to cache first and to memory only when a flush is required. The write miss policies covered in the text[10], write-allocate and write no-allocate, determine if a memory block is stored in a cache line after the write occurs. Note that since a miss occurred on write, the block is not already in a cache line. Write-miss policies can also be expressed in terms of write-allocate, fetch-on-write, and write-before-hit. These can be used in combination with the same write-through and write-back hit policies to define the complete cache write policy. Investigating alternative policies such as these is important since write-misses produce on average one-third of all cache misses.
Write hit policies
- Write-through: Also known as store-through, this policy will write to main memory whenever a write is performed to cache.
- Advantages:
- Easy to implement
- Main memory has most recent copy of the data
- Read misses never result in writes to main memory
- Disadvantages:
- Every write needs to access main memory
- Bandwidth intensive
- Writes are slower
- Advantages:
- Write-back: Also known as store-in or copy-back, this policy will write to main memory only when a block of data is purged from the cache storage.
- Advantages:
- Writes are as fast as the speed of the cache memory
- Multiple writes to a block require one write to main memory
- Less bandwidth intensive
- Disadvantages:
- Harder to implement
- Main memory may not be consistent with cache
- Reads that result in data replacement may cause dirt blocks to be written to main memory
- Advantages:
Write miss policies
Write miss policies can help eliminate write misses and therefore reduce bus traffic. This reduces the wait-time of all processors sharing the bus.
- Write-allocate vs Write no-allocate: When a write misses in the cache, there may or may not be a line in the cache allocated to the block. For write-allocate, there will be a line in the cache for the written data. This policy is typically associated with write-back caches. For no-write-allocate, there will not be a line in the cache.
- Fetch-on-write vs no-fetch-on-write: The fetch-on-write will cause the block of data to be fetched from a lower memory hierarchy if the write misses. The policy fetches a block on every write miss. Fetch-on-write determines if the block is fetched from memory, and write-allocate determines if the memory block is stored in a cache line. Certain write policies may allocate a line in cache for data that is written by the CPU without retrieving it from memory first.
- Write-before-hit vs no-write-before-hit: The write-before-hit will write data to the cache before checking the cache tags for a match. In case of a miss, the policy will displace the block of data already in the cache.
Combination Policies
The two combinations of fetch-on-write and no-write-allocate are blocked out in the above diagram. They are considered unproductive since the data is fetched from lower level memory, but is not stored in the cache. Temporal and spacial locality suggest that this data will be used again, and will therefore need to be retrieved a second time. Three combinations named write-validate, write-around, and write-invalidate all use no-fetch-on-write. They result in 'eliminated misses' when compared to a fetch-on-write policy. In general, this will yield better cache performance if the overhead to manage the policy remains low.
- Write-validate: It is a combination of no-fetch-on-write and write-allocate[4]. The policy allows partial lines to be written to the cache on a miss. It provides for better performance as well as works with machines that have various line sizes and does not add instruction execution overhead to the program being run. Write-validate requires that the lower level system memory can support writes of partial lines. The no-fetch-on-write has the advantage of avoiding an immediate bus transaction to retrieve the block from memory. For a multiprocessor with a cache coherency model, though, a bus transaction will be required to get exclusive ownership of the block. While this appears to negate the advantages of no-fetch-on-write, the process is able to continue other tasks while this transaction is occurring, including rewriting to the same cache location. Tests by the author demonstrated more than a 90% reduction in write misses when compared to fetch-on-write policies.
- Write-invalidate: This policy is a combination of write-before-hit, no-fetch-on-write, and no-write-allocate[4]. This policy invalidates lines when there is a miss. With this policy, the copy that exists in lower level memory after the write miss differs from the one in the cache. For write hits, though, the data is simply written into the cache using the cache hit policy. Thus, for hits, the cache is not written around. Tests by the author demonstrated a 35-50% reduction in write misses when compared to fetch-on-write policies.
- Write-around: Combination of no-fetch-on-write, no-write-allocate, and no-write-before-hit[4]. This policy uses a non-blocking write scheme to write to cache. It writes data to the next lower cache without modifying the data of the cache line. It bypasses the higher level cache, writes directly to the lower level memory, and leaves the existing block line in the cache unaltered on a hit or a miss. This strategy shows performance improvements when the data that is written will not be reread in the near future. Since we are writing before a hit is detected, the cache is written around for both hits and misses. The author notes that in only but a few cases write-around performs worse than write-validate policies. Most applications tend to reread what they have recently written. Using a write-around policy, this would result in a cache miss and a read from lower-level memory. With write-validate, the data would be in cache. Tests by the author demonstrated a 40-70% reduction in write misses when compared to fetch-on-write policies.
Of these three combinations, write-validate performed the worse. The author notes, though, that it does perform better than fetch-on-write and is easy to implement.
- Fetch-on-write: When used with write-allocate, fetch-on-write stores the block and processes the write to cache as expected. This policy was used as the base-line for determining the performance of the other three policy combinations. In general, all three previous combinations exhibited fewer misses than this policy.
Prefetching
Prefetching is a technique that can be used in addition to write-hit and write-miss policies to improved the utilization of the cache. Microprocessors, such as those listed in Table 1, support prefetching logic directly in the chip. Cache is very efficient in terms on access time once that data or instructions are in the cache. But when a process tries to access something that is not already in the cache, a cache miss occurs and those pages need to be brought into the cache from memory. Generally cache miss are expensive to the performance as the processor has to wait for that data (In parallel processing, a process can execute other tasks while it is waiting on data, but there will be some overhead for this). Prefetching is a technique in which data is brought into cache before the program needs it. In other words, it is a way to reduce cache misses. Prefetching uses some type of prediction to mechanism to anticipate the next data that will be needed and brings them into cache. It is not guaranteed that the prefetched data will be used. Goal here is to reduce cache misses to improve overall performance.
Some architectures have instructions to prefetch data into cache. Programmers and compliers can insert this prefect instruction in the code. This is known as software prefetching. In hardware prefetching, processor observers the system behavior and issues requests for prefetching. Intel 8086 and Motorola 68000 were the first microprocessors to implement instruction prefetch. Graphics Processing Units benefit from prefetching due to spatial locality property of the data[8].
Advantages
- Masks data access latency caused by cache misses.[20]
- Improves the performance between the processor and the memory.[20]
Disadvantages
- Wastes bandwidth when prefetched data is not used.
- Hardware prefetching requires complex architecture. Second order effect is cost of implementation on silicon and validation costs.
- Software prefetching adds additional instructions to the program, making the program larger.
- If same cache is used for prefetching, then prefetching could cause other cache blocks to be evicted. If the evicted blocks are needed, then that will generate a cache miss. This can be prevented by having a separate cache for prefetching but it comes with hardware costs.
- When scheduler changes the task running on a processor, prefetched data may become useless.
Effectiveness
Prefetching effectiveness can be tracked by following matrices
- Coverage is defined as fraction of original cache misses that were prefetched resulting in cache hit.
- Accuracy is defined as fraction of prefetches that are useful.
- Timeliness measures how early the prefetches arrive.
Ideally, a system should have high coverage, high accuracy and optimum timeliness. Realistically, aggressive prefetches can increase coverage but decrease accuracy and vice versa. Also, if prefetching is done too early, the fetched data may have to be evicted without being used and if done too late, it can cause cache miss.
Stream Buffer Prefetching
This is a technique for prefetching which uses a FIFO buffer in which each entry is a cacheline and has address (or tag) and a available bit. System prefetches a stream of sequential data into a stream buffer and multiple stream buffers can be used to prefetch multiple streams in parallel. On a cache access, head entries of stream buffers are check for match along with cache check. If the block is not found in cache but found at the head of stream buffer, it is moved to cache and next entry in the buffer becomes the head. If the block is not found in cache or as head of buffer, data is brought from memory into cache and the subsequent address as assigned to a new stream buffer. Only the heads of the stream buffers are checked during cache access and not the whole buffer. Checking all the entries in all the buffers will increase hardware complexity.
Above plot shows the cache hit improvements for with respect to number of stream buffers on different programs. Graph on left compares 8 programs from NAS suite while graph on right shows programs from Unix unity suite[7].
Prefetching in Parallel Computing
On a uniprocessor system, prefetching is definitely helpful to improve performance. On a multiprocessor system prefetching is useful, but there are tighter constrains in implementing because of the fact that data will be shared between different processors or cores. In message passing parallel model, each parallel thread has its own memory space and prefetching can be implemented in same way as for uniprocessor system.
In shared memory parallel programming, multiple threads that run on different processors share common memory space. If multiple processors use common cache, then prefetching implementation is similar to uniprocessor system. Difficulties arise when each core has its own cache. Some of the case-scenarios that can occur are:
- Processor P1 has prefetched some data D1 into its stream buffer but is not used it. At the same time processor P2 reads data D1 into its cache and modifies it and one of the cache coherency protocols would be used to inform processor P1 about this change. D1 is not in P1’s cache so it many simply ignore this. Now, when P1 ties to read D1, it will get the stale data from its stream buffer. One way to prevent this is by improving stream buffers so that they can modify their data just like a cache. This adds complexity to the architecture and increases cost[6].
- Prefetching mechanism can fetch data D1, D2, D3,….D10 into P1’s buffer. Now due to parallel processing, P1 only needs to operate on D1 to D5 while P2 will operate on remaining data. Some bandwidth was wasted in fetching D6 to D10 into P1 even though it did not use it. There is a trade off to be made here, if prefetching is very conservative then it will lead to miss and if not then it will waste bandwidth.
- In a multiprocessor system, if threads are not bound to a core, Operation system can rebalance the treads of different cores. This will require the prefetched buffers to be trashed.
Following are two examples of processors that support prefetching directly in the chip design:
Prefetching in Intel Core i7[11]
The Intel 64 Architecture, including the Intel Core i7, includes both instruction and data prefetching directly on the chip.
The Data Cache Unit prefetcher is a streaming prefetcher for L1 caches that detects ascending access to data that has been loaded very recently. The processor assumes that this ascending access will continue, and prefetches the next line.
The data prefetch logic (DPL) maintains two arrays to track the recent accesses to memory: one for the upstreams that has 12 entries, and one for downstreams that has 4 entires. As pages are accessed, their addresses are tracked in these arrays. When the DPL detects an access to a page that is sequential to an existing entry, it assumes this sequential access will continue, and prefetches the next cache line from memory.
Prefetching in AMD[12]
The AMD Family 10h processors use a stream-detection strategy similar to the Intel process described above to trigger prefetching of the next sequential memory location into the L1 cache. (Previous AMD processors fetched into the L2 cache, which introduced the access latency of the L2 cache and hindered performance.)
The AMD Family 10h can prefetch more than just the next sequential block when a stream is detected, though. When this 'unit-stride' prefetcher detects misses to sequential blocks, it can trigger a preset number of prefetch requests from memory. For example, if the preset value is two, the next two blocks of memory will be prefetched when a sequential access is detected. Subsequent detection of misses of sequential blocks may only prefetch a single block. This maintains the 'unit-stride' of two blocks in the cache ahead of the next anticipated read.
AMD contends that this is beneficial when processing large data sets which often process sequential data and process all the data in the stream. In these cases, a larger unit-stride populates the cache with blocks that will ultimately be processed by the CPU. AMD suggests that the optimal number of blocks to prefetch is between 4 and 8, and that fetching too many or fetching too soon has impaired performance in empirical testing.
The AMD Family 10h also includes 'Adaptive Prefetching', a hardware optimization that triggers prefetching when the demand stream catches up to the prefetch stream. In this case, the unit-stride is increased to assure that the prefetcher is fetching at a rate sufficient enough to continuously provide data to the processor.
Cache Coherence Support
<ref>Cache coherence Wiki</ref>Cache Coherence (also cache coherency) refers to the consistency of data stored in local caches of a shared resource. In a Shared memory multiprocessor system with a separate cache memory for each processor, it is possible to have many copies of any one instruction operand: one copy in the main memory and one in each cache memory. When one copy of an operand is changed, the other copies of the operand must be changed also. Cache coherence is the discipline that ensures that changes in the values of shared operands are propagated throughout the system in a timely fashion.
Cache coherence is achieved if the following conditions are met.
- If a processor P1 writes a value A to a memory location X and reads from the same memory location X, then P1 should be returned value A provided no other write happened in-between.
- If a processor P1 writes a value A to a memory location X and another processor P2 reads from the same memory location X, then P2 should be returned the value A written by P1 provided no other write happened in-between.
- Consider that a processor P1 writes a value A to a memory location X followed by another processor P2 which writes a value B to the same memory location X. In this case the writes should appear in the same order i.e. A and then B.
Software vs. Hardware solutions
Both software and hardware solutions exists for the cache coherence. Hardware solutions are most widely used than software solutions. Software scheme require support from cache or runtime system and some cases also require hardware assistance.
<ref>Comparison of Hardware and Software Cache Coherence Schemes Sarita V. Adve, Vikram S. Adve, Mark D. Hill, Mary K. Vernon</ref>Recent studies have shown that software schemes are comparable to the hardware solutions. The only cases for which software schemes perform significantly worse than hardware schemes are when there is a greater than 15% reduction in hit rate due to inaccurate prediction of memory access conflicts or when there are many writes in the program that are not executed at run-time. For relatively well structured and deterministic programs, on the other hand, software schemes perform significantly in the same range as the hardware schemes.
Cache Coherence Schemes – Fetch and Replacements
Invalidation Schemes vs. Update Strategies<ref>Cache Coherence Josep Torrellas</ref>
- Invalidation: On a write, all other caches with a copy are invalidated. Invalidation is least recommended when there is a single producer and many consumers of data.
- Update: On a write, all other caches with a copy are updated. Update Strategy is least recommended when there are multiple writes by one programming element say PE1 before data is read by another programming element PE2.
The strategies followed for fetch and replacement in some of the schemes are discussed below.
Snoopy Cache Coherence Schemes
- Snoopingg<ref>Cache coherence Wiki</ref> is the process where the individual caches monitor address lines for accesses to memory locations that they have cached. When a write is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.
- Snoopy Scheme is a distributed cache coherence scheme based on the notion of a snoop that watches all activity on a global bus, or is informed about such activity by some global broadcast mechanism.
- When replacement of one of the entries is required the snoop filter selects for replacement the entry representing the cache line or lines owned by the fewest nodes, as determined from a presence vector in each of the entries.
Directory Based Cache Coherence Schemes
- In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.
- Fetch and Replacement Scenarios<ref>Cache Coherence Techniques Silvia Lametti</ref>:
- Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.
- Read Miss: P1 is sent data from memory and P1 is made as the only sharing Node. The block X that is cached is marked as shared.
- Write Miss: P1 is sent the data and also made as the sharing node. The block is now marked exclusive to indicate that the only valid copy is cached.
- Block is Shared: The block requested by a Processor P1 is in shared state.
- Read Miss: Requesting Processor P1 is sent the data and P1 is added to the set of processors sharing the data.
- Write Miss: Requesting Processor P1 is sent the value. All processors sharing the data are sent and invalidate message. The block is made exclusive for the Processor P1.
- Block is Exclusive: The current value of the block (say X) is held in the cache of the processor (the owner) P1 which currently owns the block exclusively.
- Read Miss: When a processor P2 raises fetch request for the block X, P1 is sent the notification that a fetch request has been posted for a block currently held by P1. P1 sends the data back to the directory, where it is written to memory and sent back to requesting processor P2. P2 will now be added to the set of processor sharing the block X along with P1.
- Data Write-back: The owner processor P1 is replacing the block and hence must write it back. This cause the copy of block X in memory to be up to date. The block will now be uncached and the set maintaining the list of shared processors will be emptied
- Write Miss: A processor P2 makes write request to the block X which is currently owned by P1. P1 will be notified of this, which causes P1 to update the memory with its value of block X. The updated block X is now sent to P2 and it is made exclusive for P2.
- Block in Uncached State: The block required by a processor P1 is not in the cache. When a processor P1 tries to read an uncached block X, read miss occurs.
Write Through Schemes
In write through schemes, all processor writes results in update of local cache and a global bus write that updates the main memory and also invalidates or updates all other caches with that same item.
Write-Back/Ownership Schemes
In write-back/ownership schemes, a single cache has ownership of a block. In this case when a processor writes it will not result in bus writes thus conserving bandwidth.
Pointer-Based Coherence Schemes<ref>Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry</ref>
- The Full Bit Vector Schemes
- This scheme associates a complete bit vector, one bit per processor, with each block of main memory. The directory also contains a dirty-bit for each memory block to indicate if some processor has been given exclusive access to modify that block in its cache. Each bit indicates whether that memory block is being cached by the corresponding processor, and thus the directory has full knowledge of the processors caching a given block.
- When a block has to be invalidated, messages are sent to all processors whose caches have a copy. In terms of message traffic needed to keep the caches coherent, this is the best that an invalidation-based directory scheme can do.
- Limited Pointer Schemes
- Recent studies has shown that for most kinds of data objects the corresponding memory locations are cached by only a small number of processors at any given time. This knowledge can be exploited to reduce directory memory overhead by restricting each directory entry to a small fixed number of pointers, each pointing to a processor caching that memory block.
- An important implication of limited pointer schemes is that there must exist some mechanism to handle blocks that are cached by more processors than the number of pointers in the directory entry.
Cache Coherency protocol
A coherency protocol is a protocol which maintains the consistency between all the caches in a system of distributed shared memory. The protocol maintains memory coherence according to a specific consistency model. Older multiprocessors support the sequential consistency model, while modern shared memory systems typically support the release consistency or weak consistency models.
Transitions between states in any specific implementation of these protocols may vary. For example, an implementation may choose different update and invalidation transitions such as update-on-read, update-on-write, invalidate-on-read, or invalidate-on-write. The choice of transition may affect the amount of inter-cache traffic, which in turn may affect the amount of cache bandwidth available for actual work. This should be taken into consideration in the design of distributed software that could cause strong contention between the caches of multiple processors.
Various models and protocols have been devised for maintaining coherence, such as MSI, MESI (aka Illinois), MOSI, MOESI, MERSI, MESIF, Write-once (cache coherence), and Synapse, Berkeley, Firefly and Dragon.
References
1. http://www.real-knowledge.com/memory.htm
2. Computer Design & Technology- Lectures slides by Prof.Eric Rotenberg
3. Fundamentals of Parallel Computer Architecture by Prof.Yan Solihin
4. “Cache write policies and performance,” Norman Jouppi, Proc. 20th International Symposium on Computer Architecture (ACM Computer Architecture News 21:2), May 1993, pp. 191–201.
5. Architecture of Parallel Computers, Lecture slides by Prof. Edward Gehringer
6. “Parallel computer architecture: a hardware/software approach” by David. E. Culler, Jaswinder Pal Singh, Anoop Gupta (pg 887)
7. “Evaluating Stream Buffers as a Secondary Cache Replacement” by Subbarao Palacharla and R. E. Kessler. (ref#2)
8. http://en.wikipedia.org/wiki/Instruction_prefetch
9. http://www.real-knowledge.com/memory.htm
10. Yan Solihin, Fundamentals of Parallel Computer Architecture (Solihin Publishing and Consulting, LLC, 2008-2009), p. 151
11. "Intel® 64 and IA-32 Architectures Optimization Reference Manual", Intel Corporation, 1997-2011. http://www.intel.com/Assets/PDF/manual/248966.pdf
12. "Software Optimization Guide for AMD Family 10h and 12h Processors", Advanced Micro Devices, Inc., 2006–2010. http://support.amd.com/us/Processor_TechDocs/40546.pdf
13. http://en.wikipedia.org/wiki/SPARC
14. http://en.wikipedia.org/wiki/Athlon_II
15. http://en.wikipedia.org/wiki/Phenom_II
16. http://en.wikipedia.org/wiki/List_of_Intel_Core_i5_microprocessors
17. http://en.wikipedia.org/wiki/List_of_Intel_microprocessors
18. http://www.cpu-world.com/
19. http://en.wikipedia.org/wiki/CPU_cache#History
20. http://www.multicoreinfo.com/prefetching-multicore-processors/
Other References
<references/>