Revision as of 09:58, 13 April 2010

Memory Consistency models

Introduction

The memory consistency model of a shared-memory multiprocessor provides a formal specification of how the memory system will appear to the programmer, eliminating the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read in a shared-memory program execution.

This material gives a brief explanation about the intuition behind using relaxed memory consistency models for scalable design of multiprocessors. It also explains about the consistency models in real multiprocessor systems like Digital Alpha, Sparc V9 RMO, IBM Power PC, Intel Pentium, and processors from Sun Microsystems.

Sequential Consistency Model (SC)

To write correct and efficient shared memory programs, programmers need a precise notion of shared memory semantics. To ensure correct execution a programmer expects that the data value read should be the same as the latest value written to it in the system. However in many commercial shared memory systems,the processor may observe an older value, causing unexpected behavior.The Memory consistency model of a shared memory multiprocessor formaly specifies how the memory system will appera to the programmer.Essentially, a memory consistency model restricts the values that a read can return. Intuitively, a read should return the value of the "last" write to the same memory location. In uniprocessors, "last" is precisely defined by the sequential order specified by the program called program order. This is not the case in multiprocessors.A write and read of a variable, Data, are not related by program order because they reside on two different processors.

The uniprocessors model, however can be extented to apply to multiprocessors in a natural way. The resulting model is called Sequential consistency. In brief, sequential consistency requires that all memory operations

appear to execute one at a time,
all memory operations of a single processor appear to execute in the order described by that processor's program.

This model ensures that the reads of a variable, Data, will return the new values written to it by a processor. Sequential consistency provides a simple, intuitive programming model. Because of its strict consistency requirements,sequential consistency, many of the architecture and compiler optimizations used in uniprocessors are not safely applicable to sequentially consistent multiprocessors.[ For more deatils on sequential consistency model and its advantages/disadvantages refer to solihin textbook pg 284 through 292]. For this reason, many Relaxed consistency models have been proposed, most of which are supported by commercial architectures.

Performance of SC on multiprocessors

Sequential Consistency (SC) is the most intuitive programming interface for shared memory multiprocessors. A system implementing SC appears to execute memory operations one at a time and in program order. A program written for an SC system requires and relies on a specified memory behavior to execute correctly. Implementing memory accesses according to the SC model constraints, however, would adversely impact performance because memory accesses in shared-memory multiprocessors often incur prohibitively long latencies (tens of times longer than in uniprocessor systems). To enforce sequential consistency, illegal reordering caused by hardware optimizations like Write buffers, Write assembly caches, Non-blocking caches etc and compiler optimizations like code motion, register allocation, eliminating common subexpressions, loop transformations etc resulting in reordering are not allowed. These are the optimizations which are implemented for better performance and are valid in uniprocessors. But in case of multiprocessors, these do not allow to satisfy the sequential consistency requirements and hence are not allowed. This affects the performance.

A number of techniques have been proposed to enable the use of certain optimizations by the hardware and compiler without violating sequential consistency, those having the potential to substantially boost performance. Some of them are mentioned below:

Hardware optimization techniques:

Prefetching : It is hardware optimization technique in which the processor automatically prefetches ownership for any write operations that are delayed due to the program order requirement (e.g., by issuing prefetch-exclusive requests for any writes delayed

in the write buffer), thus partially overlapping the service of the delayed writes with the operations preceding them in program order. This technique is only applicable to cache-based systems that use an invalidation-based protocol. This technique is suitable for statically scheduled processors.

Speculative Reads : It is a hardware optimization technique in which read operations that are delayed due to the program order requirement are serviced speculatively ahead of time. Sequential consistency is guaranteed by simply rolling back and reissuing the read and subsequent operations in the infrequent case that the read line gets invalidated or updated before the read could have been issued in a more straightforward implementation. This is suitable for dynamically scheduled processors since much

of the roll back machinery is already present to deal with branch mispredictions.

More information about these two techniques can be found in this paper presented by Kourosh Gharachorloo, Anoop Gupta, and John Hennessy of Stanford University at International Conference on Parallel Processing, Two techniques to enhance the performance of memory consistency models

These two techniques of Prefetching and Speculative Reads are expected to be supported by several next generation microprocessors like MIPS R10000 and Intel P6, thus enabling more efficient hardware implementations of sequential consistency.

Software Optimization techniques

Shasha and Snir's agorithm : It is a compiler algorithm proposed by Dennis Shasha and Marc Snir and is used to detect when memory operations can be reordered without violating sequential consistency. It uses the technique where Sequential consistency can be enforced by delaying each access to shared memory until the previous access of the same processor has terminated. For performance reasons, it allows several accesses by the same processor to proceed concurrently. It performs an analysis to find a minimal set of delays that enforces sequential consistency. The analysis extends to interprocessor synchronization constraints and to code where blocks of operations have to execute atomically thus providing a new compiler optimization techniques for parallel languages that support shared variables.

In detail implementation of this algorithm can be studied from the paper : Efficient and correct execution of parallel programs that share memory

Compiler algorithm for SPMD (Single Program multiple data) programs : The algorithm proposed by Sasha and Snir has exponential complexity. This new algorithm simplified the cycle detection analysis used in their algorithm to achieve the job in polynomial time.

More information about this can be found in this paper by Arvind Krishnamurthy and Katherine Yelick presented at 7th International Workshop on Languages and Compilers for Parallel Computing Optimizing parallel SPMD programs

In general the performance of sequential consistency model on multiprocessors with shared memory is low. But with the use of above techniques, better performance can be achieved.

All of the above schemes mentioned to improve performance of SC allow a processor to overlap or reorder its memory accesses without software support, however, either require complex or restricted hardware (e.g., hardware prefetching and rollback) or the gains are expected to be small. Further, the optimizations of these schemes can be exploited by hardware (or the runtime system software), but cannot be exploited by compilers. Work related to compiler optimizations including that by Shasha and Snir motivate their work for hardware optimizations. Hardware costs are thus high to implement these techniques.

Hence researchers and vendors have alternatively relied on relaxed memory consistency models that augment the shared-address space programming interface with directives enabling software to inform hardware when memory ordering is necessary.

Relaxed Consistency Models

Uniprocessor optimizations such as write buffers and overlapped execution can violate sequential consistency. Therefore, general implementations of sequential consistency require a processor to execute its memory operations one at a time in program order, and writes are executed atomically. Many schemes discussed in the previous section do not require memory operations of a processor to be executed one at a time or atomically. However, these schemes either require restricting the network, or using complex hardware, or aggressive compiler technology. Even so, the performance gains possible are not known. For these reasons, there seems to be an emerging consensus in the industry to support relaxed memory models.

Different Relaxed consistency Models

=Weak Ordering

In a multiprocessor system, memory accesses are weakly ordered if

accesses to global synchronizing variables are strongly ordered
no access to a synchronizing variable is issued by a processor before all its preceding global data accesses have been globally performed
no access to global data is issued by a processor before its preceding access to a synchronizing variable has been globally performed.

@@ Line 53: / Line 53: @@
 Uniprocessor optimizations such as write buffers and overlapped execution can violate sequential consistency. Therefore, general implementations of sequential consistency require a processor to execute its memory operations one at a time in program order, and writes are executed atomically. Many schemes discussed in the previous section do not require memory operations of a processor to be executed one at a time or atomically. However, these schemes either require restricting the network, or using complex hardware, or aggressive compiler technology. Even so, the performance gains possible are not known. For these reasons, there seems to be an emerging consensus in the industry to support relaxed memory models.
+=='''Different Relaxed consistency Models'''==
+==='''Weak Ordering'''==
+In a multiprocessor system, memory accesses are weakly ordered if
+* accesses to global synchronizing variables are strongly ordered
+* no access to a synchronizing variable is issued by a processor before all its preceding global data accesses have been globally performed
+* no access to global data is issued by a processor before its preceding access to a synchronizing variable has been globally performed.
+==='''Processor Consistency'''===
+==='''Release Consistency'''===
+=='''Performance of Relaxed Consistency Models'''==
 = '''Consistency in Real Machines''' =

CSC/ECE 506 Spring 2010/chapter 10: Difference between revisions

Revision as of 09:58, 13 April 2010

Contents

Memory Consistency models

Introduction

Sequential Consistency Model (SC)

Performance of SC on multiprocessors

Relaxed Consistency Models

Different Relaxed consistency Models

=Weak Ordering

Processor Consistency

Release Consistency

Performance of Relaxed Consistency Models

Consistency in Real Machines

References

Navigation menu

CSC/ECE 506 Spring 2010/chapter 10: Difference between revisions

Revision as of 09:58, 13 April 2010

Memory Consistency models

Introduction

Sequential Consistency Model (SC)

Performance of SC on multiprocessors

Relaxed Consistency Models

Different Relaxed consistency Models

=Weak Ordering

Processor Consistency

Release Consistency

Performance of Relaxed Consistency Models

Consistency in Real Machines

References

Navigation menu

Search