CSC/ECE 506 Spring 2012/ch10 sj
Limitations of cache-coherent directory-based systems
While cache-coherent directory-based systems offer a more scalable alternative than both snoopy bus protocols and snoopy point-to-point protocols, they are not without their limitations. These limitations can be overcome to some extent, but this results in trade-offs on both the hardware and the software level.
High waiting time at memory operations
Sequential consistency affects performance in scalable design. Because of this, a typical directory-based cache coherence protocol, such as a DASH, will use a relaxed consistency protocol in order to reduce the need to wait for operations to complete. DASH's release consistency model means that processor isn't locked up after a write operation begins. Still, there can be some bottle necks since synchronization must occur at implicit or explicit fences.
Addressing these potential stalls requires additional complexity in the way the protocol is implemented. For example, simple counters on each processor to ensure that invalidation operations have reached all processors could result in a processor using a dirty cache line with outstanding invalidates. Instead, the counter must be at the cache line level, and, in the case of DASH, invalidates must be enforced at the second-level cache.<ref>http://web.cecs.pdx.edu/~alaa/ece588/papers/lenoski_isca_1990.pdf</ref>
Limited capacity for replication
Communicated data is automatically replicated only in the processor cache, not in local memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when conflict misses are numerous.
High design and implementation cost
Protocols are complex and getting them right in hardware take substantial design time. A simple directory-based protocol can be developed under various assumptions, namely that the directory state and sharing vector are always aware of current cache state and that there is no overlap between requests. These assumptions, however, do not generally hold true in real systems, and various means have been devised to handle these types of situations. The first situation can be handled by splitting up one of the states based on which node is believed to be the owner at the time. While the second situation can be handle by serializing requests, this tends to be very expensive from a performance standpoint. Additional complexity, therefore, must be built into the protocol design in order to allow for non-atomic request processing.<ref>Yan Solihin. Fundamentals of Parallel Computer Architecture. 2008-2009.</ref>
Intuition behind relaxed memory-consistency models
The System Specification
What program orders among memory operations are guaranteed to be preserved? If program order is not guaranteed to be preserved by default, what mechanisms does the system provide for a programmer to enforce order explicitly when desired?
The Programmer's Interface
The programmer's interface should provide certain rules for the programmer to follow so that (s)he doesn't have to worry about order-preserving mechanisms.
The Translation Mechanism
There needs to be a translation of the programmers' annotations to the interface exported by the system specification, so that the system may do its job.
References
<references></references>