User:Stchen: Difference between revisions
(101 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
=Introduction to Update and Adaptive Coherence Protocols on Real Architectures= | =Introduction to Update and Adaptive Coherence Protocols on Real Architectures= | ||
In parallel computer architectures, '''cache coherence''' refers to the consistency of data that is stored throughout the [http://en.wikipedia.org/wiki/Cache_(computing) caches] on individual processors or throughout the shared memory. The problem here is that | In parallel computer architectures, '''cache coherence''' refers to the consistency of data that is stored throughout the [http://en.wikipedia.org/wiki/Cache_(computing) caches] on individual processors or throughout the shared memory. The problem here is that we have multiple caches on multiple processors. When an update to a single cache makes changes to a shared memory, you will need to have all the caches be coherent on the value change. This is better shown below. | ||
<center>[[Image:Cache Coherency Generic.png | |||
<center>[[Image:Cache Coherency Generic.png|400px| Multiple Caches of Shared Resource]]</center> | |||
<center> '''Figure 1. Multiple Caches of Shared Resource''' </center> | <center> '''Figure 1. Multiple Caches of Shared Resource''' </center> | ||
Update | |||
According to <ref name="glasco">Glasco, D.B.; Delagi, B.A.; Flynn, M.J.; , "Update-based cache coherence protocols for scalable shared-memory multiprocessors," System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.1, no., pp.534-545, 4-7 Jan. 1994 | |||
doi: 10.1109/HICSS.1994.323135 | |||
[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=323135&isnumber=7709 Paper]</ref> there are two ways to maintain cache consistency. These are invalidation and update. The difference is that invalidation will "... purges the copies of the line | |||
from the other caches which results in a single copy of the line, and updating forwards the write value to the other caches, after which all caches are consistent"<ref name="glasco"></ref>. Since we have already talked about invalidation protocols, we will focus on update protocols. However, now there has been a new type of adaptive coherence protocols and that will also be discussed below. | |||
According to Solihin textbook, page number 229, "One of the drawbacks of an invalidate-based protocol is that it incurs high number of coherence misses." What this means is that when a read has been made to an invalidated block, there will be a cache miss and serving this miss can create a high latency. To solve this, one can use a update coherence protocol, or an adaptive coherence protocol. | According to Solihin textbook, page number 229, "One of the drawbacks of an invalidate-based protocol is that it incurs high number of coherence misses." What this means is that when a read has been made to an invalidated block, there will be a cache miss and serving this miss can create a high latency. To solve this, one can use a update coherence protocol, or an adaptive coherence protocol. | ||
==Update Coherence Protocol== | ==Update Coherence Protocol== | ||
==Adaptive Coherence Protocol== | ===Introduction=== | ||
Update-based cache coherence protocols work by directly updating all the cache values in the system. This differs from the invalidation-based protocols because it achieves write propagation without having to invalidation and misses. This saves on numerous coherence misses, time spent to correct the miss, and bandwidth usage. The update-based protocols we will be discussing in this section are the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] and the [http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol]. | |||
===Dragon Protocol=== | |||
The [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol] is an implementation of update-based coherence protocol. It further saves on bandwidth by updating the specific words within the cache instead of the entire block. The caches use write allocate and write update policies. The Dragon Protocol is made up of four states ('''Modified (M)''', '''Exclusive (E)''', '''Shared Modified (Sm)''', '''Shared Clean (Sc)''') and does not include an invalidation state, because if the block is in cache it is valid. | |||
* '''Modified (M)''' - cache block is exclusively owned, however it can be different from main memory | |||
* '''Exclusive (E)''' - cache block is clean and is only in one cache | |||
* '''Shared Modified (Sm)''' - cache block resides in multiple caches and is possible dirty | |||
* '''Shared Clean (Sc)''' - cache block resides in multiple caches and is clean | |||
There is not an invalidation state, because if a block is cached then it assumed to be valid. However, it can differ from main memory. Below are the finite state machines for the processor-side calls and bus-side calls. Dragon protocol utilizes snoopy caches to appear as if it as a uniform memory space even though there are multiple processors. | |||
[[File:Dragon Protocol Processor-Side.png|550px|center]] | |||
<center> '''Figure 2. Dragon Protocol Processor-Side''' </center> | |||
[[File:Dragon Protocol Bus-Side.png|500px|center]] | |||
<center> '''Figure 3. Dragon Protocol Bus-Side''' </center> | |||
The Dragon Protocol is implemented in the [http://en.wikipedia.org/wiki/Cray_CS6400 Cray CS6400] (also know as the Xerox Dragon multiprocessor workstation) and was developed by Xerox in. It was available with either 60Mhz or 85Mhz processors. The Xerox Dragon was designed to be a research numerous processors | |||
===Firefly Protocol=== | |||
[http://en.wikipedia.org/wiki/Firefly_protocol Firefly protocol] is another example of update coherence cache protocols. However, unlike the [http://en.wikipedia.org/wiki/Dragon_protocol Dragon Protocol], it uses write-through policy (which writes all changes back to memory). | |||
* '''Valid Exclusive (VE)''' - cache block is exclusively owned, cache block is clean | |||
* '''Dirty (D)''' - exclusive rights to the cache block, cache block is dirty | |||
* '''Shared (S)''' - cache block is shared but is not modified | |||
The Firefly Protocol uses a special bus technique called SharedLine to allow quick detection to copies of the block in other caches. It is similar to the COPIES_EXIST (C) and !COPIES_EXIST, and is shown that way in the finite state machines below. Similar to the Dragon protocol, there is no invalidation state because no cache blocks are ever invalidated. | |||
[[File:Firefly Protocol Processor-Side.png|550px|center]] | |||
<center> '''Figure 4. Firefly Protocol Processor-Side''' </center> | |||
[[File:Firefly Protocol Bus-Side.png|550px|center]] | |||
<center> '''Figure 5. Firefly Protocol Bus-Side''' </center> | |||
The Firefly protocol is used in the [http://en.wikipedia.org/wiki/DEC_Firefly DEC Firefly] multiprocessor workstation, developed by [http://en.wikipedia.org/wiki/Digital_Equipment_Corporation Digital Equipment Corporation]. The system is asymmetic and the cache is direct-mapped to support multiprocessing. The cache capicity was 16KB for the original [http://en.wikipedia.org/wiki/MicroVAX_78032 MicroVAX 78032] microprocessor were latter increased to 64KB when upgraded to the [http://en.wikipedia.org/wiki/CVAX#CVAX_78034 CVAX_78034] microprocessor. | |||
==Adaptive Coherence Protocols== | |||
===Introduction=== | |||
Even though there are clear advantages to using either update protocols or invalidate protocols, there are still disadvantages for each. In a write invalidate protocol, any update/write operation in a processor invalidates the shared cache blocks of other processors. This will force the other caches to do read requests such that the new data is retrieved. This tends to cause a high bus bandwidth and can be especially bad if there is few processors that frequently update the cache. Fortunately, write update protocols mitigate this issue. It will update all other caches at the same time it propagates an update itself. Unfortunately, this creates a different problem, there will sometime be unnecessary update to the cache on the other processors. This tends to increase conflict and capacity cache misses. | |||
Adaptive protocols tries to mitigate these problems. It will both tend to have some high bus traffic as well as some unnecessary updates. But, these can be mitigated based on how the adaptive algorithm switches between write-invalidate and write-update. There is also adaptive directory-based protocols, but these are not discussed here. | |||
===Competitive Update Protocol=== | |||
A competitive-update protocol is a "..hybrid protocols between write-invalidate and write-update.."<ref name="nilsson"> | |||
H. Nilsson, P. Stenström "An adaptive update-based cache coherence protocol for reduction of miss rate and traffic" | |||
Proc. Parallel Architectures and Languages Europe (PARLE) Conf., Lecture Notes in Computer Science, Athens, Greece, 817, Springer-Verlag, Berlin (Jul. 1994), pp. 363–374 | |||
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.26.7116&rep=rep1&type=pdf Paper]</ref> | |||
These hybrid protocols are used to reduce the coherence miss rate caused by invalidation or update alone. The sole issue here is that there can be high traffic peeks and these peeks can offset the performance gain<ref name="nilsson"></ref> | |||
According to Nilsson in <ref name="nilsson2">H. Nilsson, P. Stenström, and M. Dubois, “Implementation and Evaluation of Update- | |||
Based Cache Protocols Under Relaxed Memory Consistency Models”, Technical Report, | |||
Dept. of Computer Engineering, Lund University, Sweden, July 1993</ref>, competitive-update protocols will outperform write-invalidate protocols under relaxed memory consistency. The concept presented is very simple. The first write to a block causes an update to the copy of the block. If instead the local processor does not access it, it will then propagate an invalidate. What this effectively does is make regularly accessed copies of the memory block be updated. The limitation here is that migratory data makes this protocol sub-optimal. The latest research done in this area is Competitive Update Protocol with Migratory Detection<ref name="nilsson"></ref>. This recognizes when there is migratory data and compensates. | |||
<center>[[Image:CompetitiveUpdateProtocolWithMigratoryDetection.jpg|800px]]</center> | |||
<center> '''Figure 6. Competitive Update Protocol With Migratory Detection<ref name="nilsson"></ref>''' </center> | |||
<center> '''Coherence actions for detection of migratory blocks (left) and coherence actions for read misses to migratory blocks (right).''' </center> | |||
This is only one of many ways to deal with migratory data. For further reading, a Google Scholar search on "Adaptive Protocols and Migratory" will return many papers published on different ways to deal with migratory data issue that arises when using adaptive protocols. | |||
===Cachet=== | |||
Cachet is an adaptive cache coherence protocol that uses micro-protocols <ref name="shen">Xiaowei Shen, Arvind, and Larry Rudolph. 1999. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In Proceedings of the 13th international conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 135-144. DOI=10.1145/305138.305187[http://doi.acm.org/10.1145/305138.305187 Paper]</ref> Cachet recognizes that shared-memory programs have various access patterns and no fixed cache coherence protocol works well for all access patterns.<ref name="bennet">J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990</ref><ref name="eggers">S. Eggers and R. H. Katz. Evaluating the Performance for Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989</ref><ref name="falsafi">B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application specific protocols for user-level shared memory. In Supercomputing, Nov. 1994</ref><ref name="weber">W. D. Weber and A. Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, 1989</ref>. What cachet attempts to do is either take in the access pattern through program annotations from the programmer or recognition by the compiler. | |||
So how does it work? | |||
<blockquote>"Cachet-Base: The most straightforward implementation simply uses the memory as the rendezvous. When a Commit instruction is executed for an address that is cached in the Dirty state, the data must be written back to the memory before the instruction can complete. A Reconcile instruction for an address cached in the Clean state requires the data be purged from the cache before the instruction can complete. An attractive characteristic of Cachet-Base is its simplicity; no state needs to be maintained at the memory side."<ref name=shen></ref></blockquote> | |||
<blockquote>Cachet-WriterPush: Since load operations are usually more frequent than store operations, it is desirable to allow a Reconcile instruction to complete even when the address is cached in the Clean state. Thus, the following load access to the address causes no cache miss. Correspondingly, when a Commit instruction is performed on a dirty cell, it cannot complete before clean copies of the address are purged from all other caches. Therefore, it can be a lengthy process to commit an address that is cached in the Dirty state."<ref name=shen></ref></blockquote> | |||
<blockquote>"Cachet-Migratory: When an address is exclusively accessed by one processor for a reasonable time period, it makes sense to give the cache the exclusive ownership so that all instructions on the address become local operations. This is reminiscent of the exclusive state in conventional MESI like protocols. The protocol ensures that an address can be cached in at most one cache at any time. Therefore, a Commit instruction can complete even when the address is cached in the Dirty state, and a Reconcile instruction can complete even when the address is cached in the Clean state. The exclusive ownership can migrate among different caches whenever necessary."<ref name=shen></ref></blockquote> | |||
<blockquote>"Different micro-protocols are optimized for different access patterns. Cachet-Base is ideal when the location is randomly accessed by multiple processors, and only necessary commit and reconcile operations are invoked. A conventional implementation of release consistency usually requires that all addresses be indistinguishably committed before a release, and reconciled after an acquire. Such excessive use of commit and reconcile operations can result in performance degradation under Cachet-Base."<ref name=shen></ref></blockquote> | |||
<blockquote>"Cachet-WriterPush is appropriate when certain processors are likely to read an address many times before another processor writes the address. A reconcile operation performed on a clean copy causes no purge operation, regardless of whether the reconcile is necessary. Thus, subsequent load operations to the address can continually use the cached data without causing any cache miss. Cachet Migratory fits well when one processor is likely to read and write an address many times before another processor accesses the address."<ref name=shen></ref></blockquote> | |||
What is so interesting about Cachet is its ability to switch between these mirco-protocols. This excerpt from the paper does the best of explaining it. | |||
==Power Considerations== | |||
A major issue when considering power is how many bus transaction are incurring over the bus. Different protocols require different bus transactions, so we are able to loosely demonstrate how much power is being utilized by each of the different techniques by comparing there bus transactions over the same read/write pattern. We will use the patterns found in '''Ch 8 of Solihin'''. | |||
[[File:MSI_Protocol_operation.png|600px]] | |||
[[File:MOESI_Protocol_operation.png|600px]] | |||
[[File:Dragon_Protocol_operation.png|600px]] | |||
As you can see from the above spreadsheets the update-based protocols demonstrate beter power consumption in terms of bus transactions and memory access. The invalidation protocols require 6 bus transactions plus 4 memory access for the MSI protocol, as well as 5 bus transactions and 1 memory access for the MOESI protocol. This is higher compared to the Dragon protocol which only used 4 bus transactions and 1 memory access. By counting the number of bus transactions and memory access over the same procedure of processor calls we are able to put the different protocols on the same field and compare them. The Dragon protocol also requires less time on the bus because it is only passing modified words instead of whole cache blocks. | |||
Unfortunately for adaptive protocols, we are unable to provide rough estimates. Most cases with competitive update and cachet, it depends on the problem itself. For example, bus traffic can vary in a competitive update. Competitive update does invalidates on non-regularly used cache blocks. This can vary between program to program. What make power estimates hard for adaptive protocols is the nature of being adaptive. Depending on the program involved, the performance/power used can vary drastically. | |||
==Quiz== | |||
'''Question 1 : What write protocol is used in the Dragon Protocol?''' | |||
a) write-permeate | |||
b) write-through | |||
c) write-back | |||
d) write-update | |||
'''Question 2 : How does the Dragon Protocol save on bandwidth?''' | |||
a) does not have a invalid state and can not incur read/write misses | |||
b) consistently updates the memory | |||
c) updates specific words within cache instead of entire blocks | |||
d) only flushes the memory when in the exclusive state | |||
'''Question 3 : How are the states different for Dragon protocol than MOESI protocol?''' | |||
a) The Dragon shared states can allow for copies to exist in multiple caches | |||
b) split up the share state into clean and modified | |||
c) Exclusive state in the Dragon protocol can be dirty | |||
d) The Modified state is exclusively owned in the MOESI protocol | |||
'''Question 4 : Why do update-based protocols not require a invalidation state?''' | |||
a) because the it does not allow for an invalid block | |||
b) if the block is in the cache then it is valid | |||
c) (a) and (c) are correct | |||
d) None of the above | |||
'''Question 5 : Which of the following is NOT true of the states in the Firefly protocol?''' | |||
a) They states use the SharedLine technique to detect for copies of the cache | |||
b) Dirty has multiple modified cache blocks | |||
c) Shared allows for multiple copies of the cache block to exist | |||
d) Valid Exclusive allows for only one copy of the cache to be in the state | |||
'''Question 6 : What are two adaptive protocols?''' | |||
a) Competitive Update | |||
b) Competitive Invalidate | |||
c) Cachet | |||
d) Roulette | |||
'''Question 7 : Competitive Update Protocol uses what protocol(s) as its basis?''' | |||
a) Update | |||
b) Invalidate | |||
c) Both (a) and (b) | |||
d) None of the above | |||
'''Question 8 : What is the major drawback for adaptive protocols?''' | |||
a) Write-Updates | |||
b) Write-Invalidates | |||
c) Overall Hardware Limitations | |||
d) Migratory Data | |||
'''Question 9 : What is not a sub-protocol of Cachet?''' | |||
a) Cachet-Base | |||
b) Cachet-WriterPush | |||
c) Cachet-ReadPush | |||
d) Cachet-Migratory | |||
'''Question 10 : What is the basis for Cachet protocol?''' | |||
a) It switches between update and invalidate | |||
b) It switches between three sub-protocols | |||
c) It switches between update and competitive update protocols | |||
d) It switches between invalidate and competitive update protocols | |||
==References== | |||
<references /> |
Latest revision as of 02:42, 22 March 2012
Introduction to Update and Adaptive Coherence Protocols on Real Architectures
In parallel computer architectures, cache coherence refers to the consistency of data that is stored throughout the caches on individual processors or throughout the shared memory. The problem here is that we have multiple caches on multiple processors. When an update to a single cache makes changes to a shared memory, you will need to have all the caches be coherent on the value change. This is better shown below.
According to <ref name="glasco">Glasco, D.B.; Delagi, B.A.; Flynn, M.J.; , "Update-based cache coherence protocols for scalable shared-memory multiprocessors," System Sciences, 1994. Proceedings of the Twenty-Seventh Hawaii International Conference on , vol.1, no., pp.534-545, 4-7 Jan. 1994 doi: 10.1109/HICSS.1994.323135 Paper</ref> there are two ways to maintain cache consistency. These are invalidation and update. The difference is that invalidation will "... purges the copies of the line from the other caches which results in a single copy of the line, and updating forwards the write value to the other caches, after which all caches are consistent"<ref name="glasco"></ref>. Since we have already talked about invalidation protocols, we will focus on update protocols. However, now there has been a new type of adaptive coherence protocols and that will also be discussed below.
According to Solihin textbook, page number 229, "One of the drawbacks of an invalidate-based protocol is that it incurs high number of coherence misses." What this means is that when a read has been made to an invalidated block, there will be a cache miss and serving this miss can create a high latency. To solve this, one can use a update coherence protocol, or an adaptive coherence protocol.
Update Coherence Protocol
Introduction
Update-based cache coherence protocols work by directly updating all the cache values in the system. This differs from the invalidation-based protocols because it achieves write propagation without having to invalidation and misses. This saves on numerous coherence misses, time spent to correct the miss, and bandwidth usage. The update-based protocols we will be discussing in this section are the Dragon Protocol and the Firefly protocol.
Dragon Protocol
The Dragon Protocol is an implementation of update-based coherence protocol. It further saves on bandwidth by updating the specific words within the cache instead of the entire block. The caches use write allocate and write update policies. The Dragon Protocol is made up of four states (Modified (M), Exclusive (E), Shared Modified (Sm), Shared Clean (Sc)) and does not include an invalidation state, because if the block is in cache it is valid.
- Modified (M) - cache block is exclusively owned, however it can be different from main memory
- Exclusive (E) - cache block is clean and is only in one cache
- Shared Modified (Sm) - cache block resides in multiple caches and is possible dirty
- Shared Clean (Sc) - cache block resides in multiple caches and is clean
There is not an invalidation state, because if a block is cached then it assumed to be valid. However, it can differ from main memory. Below are the finite state machines for the processor-side calls and bus-side calls. Dragon protocol utilizes snoopy caches to appear as if it as a uniform memory space even though there are multiple processors.
The Dragon Protocol is implemented in the Cray CS6400 (also know as the Xerox Dragon multiprocessor workstation) and was developed by Xerox in. It was available with either 60Mhz or 85Mhz processors. The Xerox Dragon was designed to be a research numerous processors
Firefly Protocol
Firefly protocol is another example of update coherence cache protocols. However, unlike the Dragon Protocol, it uses write-through policy (which writes all changes back to memory).
- Valid Exclusive (VE) - cache block is exclusively owned, cache block is clean
- Dirty (D) - exclusive rights to the cache block, cache block is dirty
- Shared (S) - cache block is shared but is not modified
The Firefly Protocol uses a special bus technique called SharedLine to allow quick detection to copies of the block in other caches. It is similar to the COPIES_EXIST (C) and !COPIES_EXIST, and is shown that way in the finite state machines below. Similar to the Dragon protocol, there is no invalidation state because no cache blocks are ever invalidated.
The Firefly protocol is used in the DEC Firefly multiprocessor workstation, developed by Digital Equipment Corporation. The system is asymmetic and the cache is direct-mapped to support multiprocessing. The cache capicity was 16KB for the original MicroVAX 78032 microprocessor were latter increased to 64KB when upgraded to the CVAX_78034 microprocessor.
Adaptive Coherence Protocols
Introduction
Even though there are clear advantages to using either update protocols or invalidate protocols, there are still disadvantages for each. In a write invalidate protocol, any update/write operation in a processor invalidates the shared cache blocks of other processors. This will force the other caches to do read requests such that the new data is retrieved. This tends to cause a high bus bandwidth and can be especially bad if there is few processors that frequently update the cache. Fortunately, write update protocols mitigate this issue. It will update all other caches at the same time it propagates an update itself. Unfortunately, this creates a different problem, there will sometime be unnecessary update to the cache on the other processors. This tends to increase conflict and capacity cache misses.
Adaptive protocols tries to mitigate these problems. It will both tend to have some high bus traffic as well as some unnecessary updates. But, these can be mitigated based on how the adaptive algorithm switches between write-invalidate and write-update. There is also adaptive directory-based protocols, but these are not discussed here.
Competitive Update Protocol
A competitive-update protocol is a "..hybrid protocols between write-invalidate and write-update.."<ref name="nilsson"> H. Nilsson, P. Stenström "An adaptive update-based cache coherence protocol for reduction of miss rate and traffic" Proc. Parallel Architectures and Languages Europe (PARLE) Conf., Lecture Notes in Computer Science, Athens, Greece, 817, Springer-Verlag, Berlin (Jul. 1994), pp. 363–374 Paper</ref> These hybrid protocols are used to reduce the coherence miss rate caused by invalidation or update alone. The sole issue here is that there can be high traffic peeks and these peeks can offset the performance gain<ref name="nilsson"></ref> According to Nilsson in <ref name="nilsson2">H. Nilsson, P. Stenström, and M. Dubois, “Implementation and Evaluation of Update- Based Cache Protocols Under Relaxed Memory Consistency Models”, Technical Report, Dept. of Computer Engineering, Lund University, Sweden, July 1993</ref>, competitive-update protocols will outperform write-invalidate protocols under relaxed memory consistency. The concept presented is very simple. The first write to a block causes an update to the copy of the block. If instead the local processor does not access it, it will then propagate an invalidate. What this effectively does is make regularly accessed copies of the memory block be updated. The limitation here is that migratory data makes this protocol sub-optimal. The latest research done in this area is Competitive Update Protocol with Migratory Detection<ref name="nilsson"></ref>. This recognizes when there is migratory data and compensates.
This is only one of many ways to deal with migratory data. For further reading, a Google Scholar search on "Adaptive Protocols and Migratory" will return many papers published on different ways to deal with migratory data issue that arises when using adaptive protocols.
Cachet
Cachet is an adaptive cache coherence protocol that uses micro-protocols <ref name="shen">Xiaowei Shen, Arvind, and Larry Rudolph. 1999. CACHET: an adaptive cache coherence protocol for distributed shared-memory systems. In Proceedings of the 13th international conference on Supercomputing (ICS '99). ACM, New York, NY, USA, 135-144. DOI=10.1145/305138.305187Paper</ref> Cachet recognizes that shared-memory programs have various access patterns and no fixed cache coherence protocol works well for all access patterns.<ref name="bennet">J. K. Bennett, J. B. Carter, and W. Zwaenepoel. Adaptive Software Cache Management for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, May 1990</ref><ref name="eggers">S. Eggers and R. H. Katz. Evaluating the Performance for Four Snooping Cache Coherency Protocols. In Proceedings of the 16th Annual International Symposium on Computer Architecture, May 1989</ref><ref name="falsafi">B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application specific protocols for user-level shared memory. In Supercomputing, Nov. 1994</ref><ref name="weber">W. D. Weber and A. Gupta. Analysis of Cache Invalidation Patterns in Multiprocessors. In Proceedings of the 3rd International Conference on Architectural Support for Programming Languages and Operating Systems, 1989</ref>. What cachet attempts to do is either take in the access pattern through program annotations from the programmer or recognition by the compiler.
So how does it work?
"Cachet-Base: The most straightforward implementation simply uses the memory as the rendezvous. When a Commit instruction is executed for an address that is cached in the Dirty state, the data must be written back to the memory before the instruction can complete. A Reconcile instruction for an address cached in the Clean state requires the data be purged from the cache before the instruction can complete. An attractive characteristic of Cachet-Base is its simplicity; no state needs to be maintained at the memory side."<ref name=shen></ref>
Cachet-WriterPush: Since load operations are usually more frequent than store operations, it is desirable to allow a Reconcile instruction to complete even when the address is cached in the Clean state. Thus, the following load access to the address causes no cache miss. Correspondingly, when a Commit instruction is performed on a dirty cell, it cannot complete before clean copies of the address are purged from all other caches. Therefore, it can be a lengthy process to commit an address that is cached in the Dirty state."<ref name=shen></ref>
"Cachet-Migratory: When an address is exclusively accessed by one processor for a reasonable time period, it makes sense to give the cache the exclusive ownership so that all instructions on the address become local operations. This is reminiscent of the exclusive state in conventional MESI like protocols. The protocol ensures that an address can be cached in at most one cache at any time. Therefore, a Commit instruction can complete even when the address is cached in the Dirty state, and a Reconcile instruction can complete even when the address is cached in the Clean state. The exclusive ownership can migrate among different caches whenever necessary."<ref name=shen></ref>
"Different micro-protocols are optimized for different access patterns. Cachet-Base is ideal when the location is randomly accessed by multiple processors, and only necessary commit and reconcile operations are invoked. A conventional implementation of release consistency usually requires that all addresses be indistinguishably committed before a release, and reconciled after an acquire. Such excessive use of commit and reconcile operations can result in performance degradation under Cachet-Base."<ref name=shen></ref>
"Cachet-WriterPush is appropriate when certain processors are likely to read an address many times before another processor writes the address. A reconcile operation performed on a clean copy causes no purge operation, regardless of whether the reconcile is necessary. Thus, subsequent load operations to the address can continually use the cached data without causing any cache miss. Cachet Migratory fits well when one processor is likely to read and write an address many times before another processor accesses the address."<ref name=shen></ref>
What is so interesting about Cachet is its ability to switch between these mirco-protocols. This excerpt from the paper does the best of explaining it.
Power Considerations
A major issue when considering power is how many bus transaction are incurring over the bus. Different protocols require different bus transactions, so we are able to loosely demonstrate how much power is being utilized by each of the different techniques by comparing there bus transactions over the same read/write pattern. We will use the patterns found in Ch 8 of Solihin.
As you can see from the above spreadsheets the update-based protocols demonstrate beter power consumption in terms of bus transactions and memory access. The invalidation protocols require 6 bus transactions plus 4 memory access for the MSI protocol, as well as 5 bus transactions and 1 memory access for the MOESI protocol. This is higher compared to the Dragon protocol which only used 4 bus transactions and 1 memory access. By counting the number of bus transactions and memory access over the same procedure of processor calls we are able to put the different protocols on the same field and compare them. The Dragon protocol also requires less time on the bus because it is only passing modified words instead of whole cache blocks.
Unfortunately for adaptive protocols, we are unable to provide rough estimates. Most cases with competitive update and cachet, it depends on the problem itself. For example, bus traffic can vary in a competitive update. Competitive update does invalidates on non-regularly used cache blocks. This can vary between program to program. What make power estimates hard for adaptive protocols is the nature of being adaptive. Depending on the program involved, the performance/power used can vary drastically.
Quiz
Question 1 : What write protocol is used in the Dragon Protocol?
a) write-permeate b) write-through c) write-back d) write-update
Question 2 : How does the Dragon Protocol save on bandwidth?
a) does not have a invalid state and can not incur read/write misses b) consistently updates the memory c) updates specific words within cache instead of entire blocks d) only flushes the memory when in the exclusive state
Question 3 : How are the states different for Dragon protocol than MOESI protocol?
a) The Dragon shared states can allow for copies to exist in multiple caches b) split up the share state into clean and modified c) Exclusive state in the Dragon protocol can be dirty d) The Modified state is exclusively owned in the MOESI protocol
Question 4 : Why do update-based protocols not require a invalidation state?
a) because the it does not allow for an invalid block b) if the block is in the cache then it is valid c) (a) and (c) are correct d) None of the above
Question 5 : Which of the following is NOT true of the states in the Firefly protocol?
a) They states use the SharedLine technique to detect for copies of the cache b) Dirty has multiple modified cache blocks c) Shared allows for multiple copies of the cache block to exist d) Valid Exclusive allows for only one copy of the cache to be in the state
Question 6 : What are two adaptive protocols?
a) Competitive Update b) Competitive Invalidate c) Cachet d) Roulette
Question 7 : Competitive Update Protocol uses what protocol(s) as its basis?
a) Update b) Invalidate c) Both (a) and (b) d) None of the above
Question 8 : What is the major drawback for adaptive protocols?
a) Write-Updates b) Write-Invalidates c) Overall Hardware Limitations d) Migratory Data
Question 9 : What is not a sub-protocol of Cachet?
a) Cachet-Base b) Cachet-WriterPush c) Cachet-ReadPush d) Cachet-Migratory
Question 10 : What is the basis for Cachet protocol?
a) It switches between update and invalidate b) It switches between three sub-protocols c) It switches between update and competitive update protocols d) It switches between invalidate and competitive update protocols
References
<references />