CSC/ECE 506 Spring 2012/ch10 sj: Difference between revisions
| Line 8: | Line 8: | ||
| ==Limited capacity for replication== | ==Limited capacity for replication== | ||
| Communicated data is automatically replicated only in the processor cache, not in local memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when conflict misses are numerous. | Communicated data is automatically replicated only in the processor cache, not in local memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when conflict misses are numerous. | ||
| One way to address this is to increase cache size in order to reduce misses, but there are limitations to how large a cache can be. For example, a typical full-map directory coherence for a 16 multicore processor with 64-byte blocks, 1MB L2 caches, and 64KB L1 caches would require over 288,000 directory entries. The directory size per core would then need to be 1.2MB to keep everything coherent, which would exceed the size of the L2 caches. While alternatives to full-map storage can be used, the use of these typically either increases both complexity and latency or reduces the ability to scale.  | |||
| In addition, increasing cache size also increases access latency. Current caches are so large that having uniform access to the entire cache is no longer practical. Instead, the cache is broken up into slices, with the slices closest to the processer having the fastest access, while the ones further away requiring significantly more access time. This means that there are practical limits to how large a cache can be and still allow for a full-map directory.<ref>http://www.cs.cmu.edu/~hardav/papers/2009-ISCA-R-NUCA-Hardavellas.pdf</ref> | |||
| ==High design and implementation cost== | ==High design and implementation cost== | ||
| Protocols are complex and getting them right in hardware take substantial design time. A simple directory-based protocol can be developed under various assumptions, namely that the directory state and sharing vector are always aware of current cache state and that there is no overlap between requests. These assumptions, however, do not generally hold true in real systems, and various means have been devised to handle these types of situations. The first situation can be handled by splitting up one of the states based on which node is believed to be the owner at the time. While the second situation can be handle by serializing requests, this tends to be very expensive from a performance standpoint. Additional complexity, therefore, must be built into the protocol design in order to allow for non-atomic request processing.<ref>Yan Solihin. Fundamentals of Parallel Computer Architecture. 2008-2009.</ref> | Protocols are complex and getting them right in hardware take substantial design time. A simple directory-based protocol can be developed under various assumptions, namely that the directory state and sharing vector are always aware of current cache state and that there is no overlap between requests. These assumptions, however, do not generally hold true in real systems, and various means have been devised to handle these types of situations. The first situation can be handled by splitting up one of the states based on which node is believed to be the owner at the time. While the second situation can be handle by serializing requests, this tends to be very expensive from a performance standpoint. Additional complexity, therefore, must be built into the protocol design in order to allow for non-atomic request processing.<ref>Yan Solihin. Fundamentals of Parallel Computer Architecture. 2008-2009.</ref> | ||
Revision as of 02:50, 4 April 2012
Limitations of cache-coherent directory-based systems
While cache-coherent directory-based systems offer a more scalable alternative than both snoopy bus protocols and snoopy point-to-point protocols, they are not without their limitations. These limitations can be overcome to some extent, but this results in trade-offs on both the hardware and the software level.
High waiting time at memory operations
Sequential consistency affects performance in scalable design. Because of this, a typical directory-based cache coherence protocol, such as a DASH, will use a relaxed consistency protocol in order to reduce the need to wait for operations to complete. DASH's release consistency model means that processor isn't locked up after a write operation begins. Still, there can be some bottle necks since synchronization must occur at implicit or explicit fences.
Addressing these potential stalls requires additional complexity in the way the protocol is implemented. For example, simple counters on each processor to ensure that invalidation operations have reached all processors could result in a processor using a dirty cache line with outstanding invalidates. Instead, the counter must be at the cache line level, and, in the case of DASH, invalidates must be enforced at the second-level cache.<ref>http://web.cecs.pdx.edu/~alaa/ece588/papers/lenoski_isca_1990.pdf</ref>
Limited capacity for replication
Communicated data is automatically replicated only in the processor cache, not in local memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when conflict misses are numerous.
One way to address this is to increase cache size in order to reduce misses, but there are limitations to how large a cache can be. For example, a typical full-map directory coherence for a 16 multicore processor with 64-byte blocks, 1MB L2 caches, and 64KB L1 caches would require over 288,000 directory entries. The directory size per core would then need to be 1.2MB to keep everything coherent, which would exceed the size of the L2 caches. While alternatives to full-map storage can be used, the use of these typically either increases both complexity and latency or reduces the ability to scale.
In addition, increasing cache size also increases access latency. Current caches are so large that having uniform access to the entire cache is no longer practical. Instead, the cache is broken up into slices, with the slices closest to the processer having the fastest access, while the ones further away requiring significantly more access time. This means that there are practical limits to how large a cache can be and still allow for a full-map directory.<ref>http://www.cs.cmu.edu/~hardav/papers/2009-ISCA-R-NUCA-Hardavellas.pdf</ref>
High design and implementation cost
Protocols are complex and getting them right in hardware take substantial design time. A simple directory-based protocol can be developed under various assumptions, namely that the directory state and sharing vector are always aware of current cache state and that there is no overlap between requests. These assumptions, however, do not generally hold true in real systems, and various means have been devised to handle these types of situations. The first situation can be handled by splitting up one of the states based on which node is believed to be the owner at the time. While the second situation can be handle by serializing requests, this tends to be very expensive from a performance standpoint. Additional complexity, therefore, must be built into the protocol design in order to allow for non-atomic request processing.<ref>Yan Solihin. Fundamentals of Parallel Computer Architecture. 2008-2009.</ref>
Intuition behind relaxed memory-consistency models
The System Specification
What program orders among memory operations are guaranteed to be preserved? If program order is not guaranteed to be preserved by default, what mechanisms does the system provide for a programmer to enforce order explicitly when desired?
The Programmer's Interface
The programmer's interface should provide certain rules for the programmer to follow so that (s)he doesn't have to worry about order-preserving mechanisms.
The Translation Mechanism
There needs to be a translation of the programmers' annotations to the interface exported by the system specification, so that the system may do its job.
References
<references></references>