CSC/ECE 506 Spring 2012/ch10 sj: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
(Start adding consistency model descriptions)
Line 1: Line 1:
=Memory Consistency Models=
==Introduction==
In a parallel environment, parallel memory access is crucial for good performance. Memory consistency models provide a set of rules to describe memory interactions that allow the programmer to expect a certain outcome. Without any consistency guarantees, most parallel programs would be extremely non-deterministic.
This chapter covers various models of consistency and their advantages, as well as which models are used by actual systems and their performance.
==Sequential Consistency==
Sequential consistency is the simplest memory consistency model, based on the operation of a single processor system. This model as formally defined by Lampert<ref>
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690–691, September 1979.</ref> can be stated with two rules:
* Memory operations must appear to occur atomically
* Operations on a single processor must appear to operate in program order
These two simple rules create a strong guarantee of memory consistency, and allow programmers to reason about the probable outcomes of a program. They do not ensure deterministic execution, as the operations on multiple processors can occur simultaneously and the relative order between processors can change, but they greatly reduce the number of possibilities that can occur and allow reasoning about program execution. If no consistency was guaranteed, then there would be no way to know in which order operations might be executed and reasoning would be impossible.
==Relaxed Consistency==
Sequential consistency is easy to understand and implement, but it is fairly restrictive and can reduce performance. In some cases SC is stronger than necessary and a more relaxed model can be used in order to achieve greater performance. These Relaxed Consistency models allow a greater amount of reordering, while still ensuring that there will be some consistency.
=Limitations of cache-coherent directory-based systems=
=Limitations of cache-coherent directory-based systems=
While cache-coherent directory-based systems offer a more scalable alternative than both snoopy bus protocols and snoopy point-to-point protocols, they are not without their limitations. These limitations can be overcome to some extent, but this results in trade-offs on both the hardware and the software level.
While cache-coherent directory-based systems offer a more scalable alternative than both snoopy bus protocols and snoopy point-to-point protocols, they are not without their limitations. These limitations can be overcome to some extent, but this results in trade-offs on both the hardware and the software level.

Revision as of 19:55, 4 April 2012

Memory Consistency Models

Introduction

In a parallel environment, parallel memory access is crucial for good performance. Memory consistency models provide a set of rules to describe memory interactions that allow the programmer to expect a certain outcome. Without any consistency guarantees, most parallel programs would be extremely non-deterministic.

This chapter covers various models of consistency and their advantages, as well as which models are used by actual systems and their performance.

Sequential Consistency

Sequential consistency is the simplest memory consistency model, based on the operation of a single processor system. This model as formally defined by Lampert<ref> Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690–691, September 1979.</ref> can be stated with two rules:

  • Memory operations must appear to occur atomically
  • Operations on a single processor must appear to operate in program order

These two simple rules create a strong guarantee of memory consistency, and allow programmers to reason about the probable outcomes of a program. They do not ensure deterministic execution, as the operations on multiple processors can occur simultaneously and the relative order between processors can change, but they greatly reduce the number of possibilities that can occur and allow reasoning about program execution. If no consistency was guaranteed, then there would be no way to know in which order operations might be executed and reasoning would be impossible.

Relaxed Consistency

Sequential consistency is easy to understand and implement, but it is fairly restrictive and can reduce performance. In some cases SC is stronger than necessary and a more relaxed model can be used in order to achieve greater performance. These Relaxed Consistency models allow a greater amount of reordering, while still ensuring that there will be some consistency.


Limitations of cache-coherent directory-based systems

While cache-coherent directory-based systems offer a more scalable alternative than both snoopy bus protocols and snoopy point-to-point protocols, they are not without their limitations. These limitations can be overcome to some extent, but this results in trade-offs on both the hardware and the software level.

High waiting time at memory operations

Sequential consistency affects performance in scalable design. Because of this, a typical directory-based cache coherence protocol, such as a DASH, will use a relaxed consistency protocol in order to reduce the need to wait for operations to complete. DASH's release consistency model means that processor isn't locked up after a write operation begins. Still, there can be some bottle necks since synchronization must occur at implicit or explicit fences.

Addressing these potential stalls requires additional complexity in the way the protocol is implemented. For example, simple counters on each processor to ensure that invalidation operations have reached all processors could result in a processor using a dirty cache line with outstanding invalidates. Instead, the counter must be at the cache line level, and, in the case of DASH, invalidates must be enforced at the second-level cache.<ref>http://web.cecs.pdx.edu/~alaa/ece588/papers/lenoski_isca_1990.pdf</ref>

Limited capacity for replication

Communicated data is automatically replicated only in the processor cache, not in local memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when conflict misses are numerous.

One way to address this is to increase cache size in order to reduce misses, but there are limitations to how large a cache can be. For example, a typical full-map directory coherence for a 16 multicore processor with 64-byte blocks, 1MB L2 caches, and 64KB L1 caches would require over 288,000 directory entries. The directory size per core would then need to be 1.2MB to keep everything coherent, which would exceed the size of the L2 caches. While alternatives to full-map storage can be used, the use of these typically either increases both complexity and latency or reduces the ability to scale.

In addition, increasing cache size also increases access latency. Current caches are so large that having uniform access to the entire cache is no longer practical. Instead, the cache is broken up into slices, with the slices closest to the processer having the fastest access, while the ones further away requiring significantly more access time. This means that there are practical limits to how large a cache can be and still allow for a full-map directory.<ref>http://www.cs.cmu.edu/~hardav/papers/2009-ISCA-R-NUCA-Hardavellas.pdf</ref>

High design and implementation cost

Protocols are complex and getting them right in hardware take substantial design time. A simple directory-based protocol can be developed under various assumptions, namely that the directory state and sharing vector are always aware of current cache state and that there is no overlap between requests. These assumptions, however, do not generally hold true in real systems, and various means have been devised to handle these types of situations.

The first situation can be handled by splitting up one of the states based on which node is believed to be the owner at the time. While the second situation can be handle by serializing requests, this tends to be very expensive from a performance standpoint. Additional complexity, therefore, must be built into the protocol design in order to allow for non-atomic request processing.<ref>Yan Solihin. Fundamentals of Parallel Computer Architecture. 2008-2009.</ref>

Intuition behind relaxed memory-consistency models

The System Specification

What program orders among memory operations are guaranteed to be preserved? If program order is not guaranteed to be preserved by default, what mechanisms does the system provide for a programmer to enforce order explicitly when desired?

The Programmer's Interface

The programmer's interface should provide certain rules for the programmer to follow so that (s)he doesn't have to worry about order-preserving mechanisms.

The Translation Mechanism

There needs to be a translation of the programmers' annotations to the interface exported by the system specification, so that the system may do its job.

References

<references></references>