Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:51:50Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability <ref>http://en.wikipedia.org/wiki/Linearizability</ref>===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency <ref>http://en.wikipedia.org/wiki/PRAM_consistency</ref>===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency <ref>http://en.wikipedia.org/wiki/Release_consistency</ref>===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency <ref>http://en.wikipedia.org/wiki/Sequential_consistency</ref>===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact <ref>http://classes.soe.ucsc.edu/cmpe221/Spring05/papers/29multi.pdf</ref>==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:51:03Z

Sshanbh: /* PRAM consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability <ref>http://en.wikipedia.org/wiki/Linearizability</ref>===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency <ref>http://en.wikipedia.org/wiki/PRAM_consistency</ref>===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency <ref>http://en.wikipedia.org/wiki/Release_consistency</ref>===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency <ref>http://en.wikipedia.org/wiki/Sequential_consistency</ref>===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:45:45Z

Sshanbh: /* Sequential consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability <ref>http://en.wikipedia.org/wiki/Linearizability</ref>===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency <ref>http://en.wikipedia.org/wiki/Release_consistency</ref>===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency <ref>http://en.wikipedia.org/wiki/Sequential_consistency</ref>===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:44:33Z

Sshanbh: /* Release consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability <ref>http://en.wikipedia.org/wiki/Linearizability</ref>===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency <ref>http://en.wikipedia.org/wiki/Release_consistency</ref>===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:42:43Z

Sshanbh: /* Linearizability */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability <ref>http://en.wikipedia.org/wiki/Linearizability</ref>===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:40:11Z

Sshanbh: /* Eventual consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency <ref>http://en.wikipedia.org/wiki/Eventual_consistency</ref>===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:38:27Z

Sshanbh: /* Entry consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency <ref>http://cs.gmu.edu/cne/modules/dsm/orange/entry_con.html</ref>===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:36:43Z

Sshanbh: /* Delta consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency <ref>http://en.wikipedia.org/wiki/Delta_consistency</ref>===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T08:32:52Z

Sshanbh: /* Address translation aware memory consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency <ref>http://people.ee.duke.edu/~sorin/papers/ieeemicro11_toppick.pdf</ref> ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T07:51:13Z

Sshanbh: /* Entry consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each [http://en.wikipedia.org/wiki/Critical_section critical section] just like in both variants of [http://en.wikipedia.org/wiki/Release_consistency release consistency]. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T07:32:06Z

Sshanbh: /* Virtual address memory consistency (VAMC) */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple [http://en.wikipedia.org/wiki/Virtual_address_space virtual addresses] may map to the same [http://en.wikipedia.org/wiki/Physical_address physical address]. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T07:12:12Z

Sshanbh: /* Consistency in current-day multiprocessors */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. [http://en.wikipedia.org/wiki/Multithreading_(computer_architecture) Multithreading] attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T07:09:12Z

Sshanbh: /* Consistency in current-day multiprocessors */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable system should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T07:05:25Z

Sshanbh: /* Consistency in current-day multiprocessors */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared [http://en.wikipedia.org/wiki/Virtual_memory virtual memory]. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T06:59:29Z

Sshanbh: /* Consistency in current-day multiprocessors */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory is physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==

In this article all the popular memory consistency models have been discussed in terms of how they are able to support multiprocessors today. Also, a performance based comparison of the different consistency models is presented. A decent effort has been made to discuss the relative weaknesses and strengths of different models. From what has been laid out earlier it is easy to conclude that there is no one best model, only the model that best fits the application, architecture, and programmer.

The choice is between a very strict consistency model such as SC and other relaxed consistency models. The sequential memory model is the easiest model to use for parallel computing. For an novice programmer who does not fully understand the problems inherent with memory consistency and cannot implement explicit mechanisms for synchronization, a sequentially consistent system may be the only possible model that will result in program correctness. There will be a significant penalty in performance, however, which could negate even the advantage of using a parallel system in the first place. Across a wide range of applications, the Relaxed Consistency models would average the best performance, but at greater programmer effort. A programmer with extensive knowledge of the architecture and experience in parallel programming will be able to write a high-performing application while ensuring correctness with synchronization mechanisms like barriers, locks, and fence instructions.

This article should help a programmer to make this choice and help select an appropriate model for his/her multiprocessor design.

==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T01:27:08Z

Sshanbh: /* Introduction */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support [http://en.wikipedia.org/wiki/Shared_memory shared memory] in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T01:13:47Z

Sshanbh:

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-05T00:28:37Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution. The reason for using processor utilization as the distinguishing factor is that it provides reasonable results even when the program’s control path is not deterministic and depends on relative timing of synchronization accesses. Let us make a comparative analysis of the performance achieved by the various consistency models on the LFC architecture (an aggressive implementation with lock-free caches).

A BASE model has been added to the four consistency models viz. Sequential Consistency (SC), Processor Consistency (PC), Release Consistency (RC) and Weak Consistency (WC). This is the most constrained model and is used as baseline for all performance comparisons. It incorporates no buffering or pipelining and waits for each read and write to complete before proceeding.

=== Performance of SC versus BASE ===

The SC model does not perform signiﬁcantly better than BASE. The performance gains from BASE to SC is small for most applications. This is because reads are expected to be closely interleaved with writes. Signiﬁcant write clustering may occur sometimes, for example, when initializing data structures, but such occurrences are expected to be infrequent.

=== Performance of PC versus SC and BASE ===

In PC, sequential consistency is abandoned. The main beneﬁt of this extra complexity comes from the fact that reads do not have to stall for pending writes to perform. However, some of the beneﬁts may be lost if the write buffer gets full and stalls the processor. The PC model is relatively successful in hiding almost all of the latency of writes given a reasonably deep write buffer. Since the comparison of PC and WC is more involved than the comparison with RC, we next examine PC versus RC. Subsequently, we compare PC and WC.

=== Performance of PC versus RC ===

In addition to providing all the benefits of the PC model, it allows pipelining of writes by exploiting information about the synchronization accesses. That is, writes can be retired from the write buffer before ownership has been obtained. The fact that writes are retired at a faster rate has two implications:
* the write buffer becoming full is less of a problem, and
* in cases where a release operation is behind several writes in the write buffer, that release can be observed sooner by
a processor waiting to do an acquire.
The write buffer getting full is really not a problem for PC. Therefore, most gains, if observed, will be due to faster completion of synchronization.

=== Performance of WC versus PC and RC ===

The differences between WC and RC arise because WC does not differentiate between acquire and release synchronization operations. Consequently, any synchronization operation must conservatively satisfy the constraints of both release and acquire. Thus, compared to RC, WC stalls the processor at an acquire until pending writes and releases complete. In addition, the processor is stalled for pending releases if it
attempts to do a read operation.

On comparing WC and PC we observe a surprising result that PC sometimes performs better than WC. WC has the advantage that writes can be retired at a faster rate from the write buffer. The disadvantage of WC to PC is the same as the disadvantage of WC to RC, in that WC stalls the processor at some points for pending writes and releases to perform.

== Conclusion ==
==References==
<references/>

File:Consistency models.jpg

2012-04-04T23:30:21Z

Sshanbh: uploaded a new version of "File:Consistency models.jpg": Implementation of consistency models

Implementation of consistency models

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T23:25:04Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|700px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution.

== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T23:24:32Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|center|500px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution.

== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T23:23:53Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

[[Image:consistency_models.jpg|thumb|right|300px|Implementation of consistency models]]

Here, we define performance as the processor utilization achieved in an execution.

== Conclusion ==
==References==
<references/>

File:Consistency models.jpg

2012-04-04T23:22:14Z

Sshanbh: Implementation of consistency models

Implementation of consistency models

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T23:19:45Z

Sshanbh: /* Practical performance impact */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==

Here, we define performance as the processor utilization achieved in an execution.

== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T22:19:41Z

Sshanbh: /* Strong consistency */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

Strong consistency is one of the consistency models used in the domain of concurrent programming (e.g. in distributed shared memory, distributed transactions etc.). Strong consistency is supported if:
* All accesses are seen by all parallel processes (or nodes, processors etc.) in the same order (sequentially)
* Only one consistent state can be observed. On the other hand, in weak consistency (where different parallel processes or nodes etc.) perceive variables in different states.

== Practical performance impact ==
== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T22:14:53Z

Sshanbh: /* Consistency models used */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

== Practical performance impact ==
== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T22:03:35Z

Sshanbh: /* Consistency models used */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Serializability ===
=== Vector-field consistency ===
=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

== Practical performance impact ==
== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T22:00:47Z

Sshanbh: /* Linearizability */

'''Use of consistency models in current multiprocessors'''

The memory consistency model of a shared memory system determines the order in which memory operations will appear to execute to the programmer. This article describes how consistency is used in multiprocessors today and later digs into the details of popular consistency models in use today. The impact of these models on the multiprocessor performance is also discussed. The article finishes off with a discussion about how the consistency models perform with larger multiprocessors.

== Introduction ==

Many modern computer systems and most multicore chips support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to
a single shared address space. These designs seek various goodness properties, such as high performance, low power, and low cost. Of course, it is not valuable to provide these goodness properties without first providing correctness. Correct shared memory seems intuitive at a hand-wave level, but, there are subtle issues in even defining what it means for a shared memory system to be correct, as well as many subtle corner cases in designing a correct shared memory implementation. Moreover, these subtleties must be mastered in hardware implementations where bug fixes are expensive.

It is the job of consistency to define shared memory correctness. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. Ideally, consistency definitions would be simple and easy to understand. However, defining what it means for shared memory to behave correctly is more subtle than defining the correct behavior of, for example, a single-threaded processor core. The correctness criterion for a single processor core partitions behavior between one correct result and many incorrect alternatives. This is because the processor’s architecture mandates that the execution of a thread transforms a given input state into a single well-defined output state, even on an out-of-order core. Shared memory consistency models, however, concern the loads and stores of multiple threads and usually allow many correct executions while disallowing many incorrect ones. The possibility of multiple correct executions is due to the ISA allowing multiple threads to execute concurrently, often with many possible legal interleavings of instructions from different threads. The multitude of correct executions complicates the erstwhile simple challenge of determining whether an execution is correct. Nevertheless, consistency must be mastered to implement shared memory and, in some cases, to write correct programs that use it.

== Consistency in current-day multiprocessors ==

Today's scalable multiprocessors are mostly built with a distributed shared memory architecture. The memory physically distributed but logically shared. In other words, the address spaces generated by all processing nodes globally form a single address space. The main advantage of such a system lies in the scalability with distributed hardware and programmability of a shared virtual memory. Representative systems include the Cray T3D, the Stanford Dash, the MIT Alewife , the Teracomputer, and Convex SPP.

A scalable systems should be able to hide the long latencies of remote memory accesses. Several latency hiding techniques have been proposed: Coherent caches reduce frequency of remote memory accesses by caching data close to the processor. Relaxed memory consistency allows reordering of memory events and buffering or pipelining of remote memory accesses. Data prefetching attempts to hide long read latency by issuing read requests well ahead of time, with the exception that the data will be available in the cache when it is referenced. Multithreading attempts to hide the long latency by context switching between several active threads, thus allowing the processor to perform useful work while waiting for remote requests or synchronization faults to complete.

== Consistency models used ==

=== Address translation aware memory consistency ===

These memory consistency models define the behavior of operations (loads, stores, memory barriers, etc.) on physical addresses and virtual addresses. The two important levels of memory consistency that can be classified as address translation aware are described below:

==== Physical address memory consistency (PAMC)====
It is necessary to have correct PAMC for unmapped code to work correctly. Unmapped software, including the boot code and part of the system software that manages AT, relies upon PAMC. It is the responsibility of the hardware to implement PAMC and this is specified precisely in the architectural manual.
It is not too difficult to adapt an AT-oblivious consistency model as the specification of PAMC.

Example:
The PAMC model could be SC. In such a case the interface would specify that
(i) there must exist a total order of all loads and stores to physical addresses that respects the program order of each thread and
(ii) the value of each load is equal to the value of the most recent store to that physical address in the total order.

==== Virtual address memory consistency (VAMC)====
Correct VAMC is required for mapped code to work correctly.

Although adapting an AT-oblivious consistency model for PAMC is straightforward, there are a few challenges when adapting an AT-oblivious consistency model for VAMC:
* synonyms - Multiple virtual addresses may map to the same physical address. Suppose two virtual addresses VA1 and VA2 map to the same physical address PA. SC requires that the value of a load is equal to the value of the most recent store to the same address. It is possible to have a naive definition of VAMC that does not consider the level of indirection introduced by AT. To overcome this challenge, we re-formulate AT-oblivious consistency models for VAMC by applying the model to synonym sets of virtual addresses rather than individual addresses. For instance, we can define SC for VAMC as follows: there must exist a total order of all loads and stores to virtual addresses that respects program order and in which each load gets the value of the most recent store to any virtual address in the same virtual address synonym set. Incorporating synonyms explicitly in the consistency model allows programmers to reason about the ordering of accesses to virtual addresses.
* mapping and permission changes - Another challenge is that the set of memory operations at the VAMC level is richer than at the PAMC level.
* load/store side effects - Yet another challenge in specifying VAMC is that loads and stores to virtual addresses have certain side effects. The AT system includes status bits such as accessed and dirty bits for each page table entry. These status bits are part of the architectural state, and the ordering of updates to those bits must thus be specified in VAMC. To achieve this we add two new operations to the specification tables: Ld-sb (load’s impact on status bits) and St-sb (store’s impact on status bits).

=== Causal consistency ===
[[Image:causal.jpg|thumb|right|300px|Causal Consistency example]]

Hutto and Ahamad <ref>P.W. Hutto andM. Ahamad. Slowmemory: Weakening consistency to enhance concurrency in distributed shared memories. In
Proceedings of the 10th International Conference on Distributed Computing Systems, pages 302–311,May 1990.</ref> introduced causal consistency. Lamport <ref>Leslie Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565, 1978.</ref> defined the notion of potential causality to capture the flow of information in a distributed system. This notion can be applied to a memory system by interpreting a write as a message-send event and a read as a message-read event. A memory is causally consistent if all processors agree on the order of causally related events. Causally unrelated events (concurrent events) can be observed in different orders.
The example shown here is a legal execution history under CC but not under SC. Note that W(x)1 and W(x)2 are causally related as P2 observed the first write by P1. Furthermore, P3 and P4 observe the accesses W(x)2 and W(x)3 in different orders, which would not be legal in SC.

=== Delta consistency ===

The delta consistency model states that after a fixed time period δ, an update is propagated through the system and all replicas will be consistent. In other words, barring a short bounded interval after a modification, the result of any read operation is consistent with a read on the original copy of an object. If an object is modified, the read will not be consistent during the short period of time following its modification. Once the fixed time period elapses, the modification is propagated and the read is now consistent.

=== Entry consistency ===

This consistency model has been designed to be used with critical sections. Here, the programmer needs to use acquire and release at the start and end of each critical section just like in both variants of release consistency. However it also required every ordinary shared variable to be associated with a synchronization variable such as a lock or a barrier. If the elements of an array need to be accessed independently in parallel, then each element of the array must be associated with a lock. When an acquire is done on a synchronization variable, only those ordinary shared variables guarded by that synchronization variable are made consistent. Release consistency does not associate shared variables with locks or barriers and at acquire time has to determine empirically which variables it needs. This is where entry consistency differs from release consistency.

Formally, a memory exhibits entry consistency if the following conditions are met:

* An acquire access of a synchronization variable is not allowed to perform with respect to a process until all updates to the guarded shared data have been performed with respect to that process.
* Before an exclusive mode access to a synchronization variable by a process is allowed to perform with respect to that process, no other process may hold the synchronization variable, not even in nonexclusive mode.
* After an exclusive mode access to a synchronization variable has been performed, any other process next nonexclusive mode access to that synchronization variable may not be performed until it has performed with respect to that variable's owner.

=== Eventual consistency ===

Given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system. Thus, all the replicas will be consistent. As the consistency achieved is eventual, the possibility of conflicts is high. The conflicts have to be resolved. There are three types of resolution:
* Read repair: The correction is done when a read finds an inconsistency. This slows down the read operation.
* Write repair: The correction is done when a write operation finds an inconsistency, slowing down the write operation.
* Asynchronous repair: The correction is not part of a read or write operation.

=== Linearizability ===

This is also known as strict or atomic consistency. An operation is linearizable if it appears to the rest of the system to occur instantaneously. An atomic operation either occurs completely or does not occur at all. In reality, atomic operations do not actually occur instantaneously. Atomicity is enforced by mutual exclusion. At the software level, locks or semaphores are used to achieve this, while at the hardware level, a cache coherency protocol maybe used. This makes it appear to the user that the entire operation occurred in a single instruction.

A sequence of invocations and responses made of an object by a set of threads is referred to as history. When a function is invoked, a subsequent response is generated.

Example:
Suppose two threads, A and B attempt to acquire a lock, backing off if it's already taken.
This would be modeled as both threads invoking the lock operation, then both threads receiving a response, one successful, one not.
A calls lock
B calls lock
lock returns fail to A
lock returns success to B

When all calls make immediate responses, the history is called sequential history. A history that is linearizable is:

* its invocations and responses can be reordered to yield a sequential history
* that sequential history is correct according to the sequential definition of the object
* if a response preceded an invocation in the original history, it must still precede it in the sequential reordering

Now we can reorder the above example in two ways as follows:

Example:
One way would be:
A calls lock
lock returns fail to A
B calls lock
lock returns success to B

The other way would be:
B calls lock
lock returns success to B
A calls lock
lock returns fail to A

Thus, an object is linearizable if all valid histories of its use can be linearized.

=== One-copy serializability ===
=== PRAM consistency ===
[[Image:pram.jpg|thumb|right|300px|PRAM Consistency example]]

It is also known as FIFO consistency. The reasoning that led to this model was as follows: Consider a multi-processor where each processor has a
local copy of the shared memory. For the memory to be scalable, an access should be independent of the time it takes to access the other processors’ memories. They proposed that on a read, a PRAM would simply return the value stored in the local copy of the memory. On a write,
it would update the local copy first and broadcast the new value to the other processors. Assuming a constant time for initiating a broadcast operation, the goal of making the cost for a read or write constant is thus achieved. In terms of ordering constraints, this is equivalent to
requiring that all processors observe the writes from a single processor in the same order while they may disagree on the order of writes by different processors.
The example shown is legal under PRAM but not under SC or CC. P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2 are potentially causally related.

=== Release consistency ===

Release consistency is one of the consistency models used in the domain of the concurrent programming (e.g. in distributed shared memory, distributed transactions etc.).

Systems of this kind are characterised by the existence of two special synchronisation operations, release and acquire. Before issuing a write to a memory object a node must acquire the object via a special operation, and later release it. Therefore the application that runs within the operation acquire and release constitutes the critical region. The system is said to provide release consistency, if all write operations by a certain node are seen by the other nodes after the former releases the object and before the latter acquire it.

There are two kinds of protocols that implement release consistency:
* ''eager'', where all coherence actions are performed on release operations, and
* ''lazy'', where all coherence actions are delayed until after a subsequent acquire

=== Sequential consistency ===
[[Image:sc.jpg|thumb|right|300px|PRAM Consistency example]]

Sequential consistency was first defined by Lamport in 1979. He defined a memory system to be sequentially consistent if:
the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations
of each individual processor appear in this sequence in the order specified by it's program.

In a sequentially consistent system, all processors must agree on the order of observed effects. The image to the right shows an example for SC.
Note that R(y)2 by processor P3 reads a value that has not been written yet! Of course, this is not possible in any real physical system. However, it shows a surprising flexibility of the SC model. Another reason why this is not a legal history for atomic consistency is that the
write operations W(x)1 and W(y)2 appear commuted at processor P3.

Sequential consistency has been the canonical memory consistency model for a long time. However, many multiprocessor machines actually implement a slightly weaker model called processor consistency.

=== Serializability ===
=== Vector-field consistency ===
=== Weak consistency ===

A memory system is weakly consistent if it enforces the following restrictions:
* accesses to synchronization variables are sequentially consistent and
* no access to a synchronization variable is issued in a processor before all previous data accesses have been performed and
* no access is issued by a processor before a previous access to a synchronization variable has been performed
Notice that the meaning of “previous” is well-defined because it refers to program order. That is, an access A precedes access B if an only if the processor that executed access B has previously executed access A. Synchronizing accesses work as fences. At the time a synchronizing access performs, all previous accesses by that processor are guaranteed not to have performed. The synchronization model corresponding to these access order constraints is relatively simple. A program executing on a weakly consistent system appears sequentially consistent if the following two constraints are observed:
* There are no data races and
* Synchronization is visible to the memory system.

=== Strong consistency ===

== Practical performance impact ==
== Conclusion ==
==References==
<references/>

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T09:41:33Z

Sshanbh: /* Consistency models used */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T09:40:00Z

Sshanbh: /* Eventual consistency */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T09:31:03Z

Sshanbh: /* Entry consistency */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T09:09:11Z

Sshanbh: /* Delta consistency */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T08:59:30Z

Sshanbh: /* Physical address memory consistency (PAMC) */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T08:58:48Z

Sshanbh: /* Physical address memory consistency (PAMC) */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T08:49:55Z

Sshanbh: /* Virtual address memory consistency (VAMC) */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T08:14:48Z

Sshanbh: /* Address translation aware memory consistency */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T06:36:32Z

Sshanbh: /* Consistency models used */

CSC/ECE 506 Spring 2012/10b sr

2012-04-04T01:45:24Z

Sshanbh: Created page with "'''Use of consistency models in current multiprocessors''' == Introduction == == Consistency in current-day multiprocessors == == Consistency models used == == Practical perform..."

'''Use of consistency models in current multiprocessors'''

== Introduction ==
== Consistency in current-day multiprocessors ==
== Consistency models used ==
== Practical performance impact ==
== Conclusion ==

CSC/ECE 506 Spring 2012/4b rs

2012-02-14T04:27:18Z

Sshanbh: /* Superlinear speedup */

'''The limits to speedup'''

== Introduction <ref>http://en.wikipedia.org/wiki/Amdahl%27s_law</ref> ==
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm. Parallel computing gains very high importance in scientific computations because of its good speedup. More so in those computations that involve large-scaled data. As far as the need for parallel computing goes, one claim is that we can always double the speed of a chip every 18 months according to Moore’s Law. This means there is no need to develop parallel computation. This claim however, has been proved to be wrong.

According to [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. But this solves a fixed problem in the shortest possible period of time, rather than solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed "reasonable" amount of time. To overcome these shortcomings, [http://en.wikipedia.org/wiki/John_L._Gustafson John L. Gustafson] and his colleague Edwin H. Barsis described [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Gustafson's Law], which provides a counterpoint to Amdahl's law, which describes a limit on the speed-up that parallelization can provide, given a fixed data set size.

== Scaled speedup==

Scaled speedup is the speedup that can be achieved by increasing the data size. This increase in data size is done to solve a given problem on multiple parallel processors. In other words, with larger number of parallel processors at our disposal, we can increase the data size of the same problem and achieve higher speedup. This is what is referred to as scaled speedup. Means of achieving this speedup are exploited by Gustafson's Law.

===Gustafson's Law <ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law</ref>===

[[Image:Gustafson.png|thumb|right|350px|Figure 1. Gustafson's Law]]

Gustafson's Law says that it is possible to parallelize computations when they involve significantly large data sets. It says that there is skepticism regarding the viability of massive parallelism. This skepticism is largely due to Amdahl's law, which says that the maximum speedup that can be achieved in a given problem with serial fraction of work s, is 1/s, even when the number of processors increases to an infinite number. For example, if 5% of computation in a problem is serial, then the maximum achievable speedup is 20 regardless of the number of processors. This is not a very encouraging result.

Amdahl's law does not fully exploit the computing power that becomes available as the number of machines increases. Gustafson's law addresses
this limitation. It considers the effect of increasing the problem size. Gustafson reasoned that when a problem is ported onto a multiprocessor system, it is possible to consider larger problem sizes. In other words, the same problem with a larger number of data values takes the same time. The law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a
practical fixed time. Larger problems can be solved in the same time if faster, i.e. more parallel equipment is available. Therfore, it should be possible to achieve high speedup if we scale the problem size.

Example:
s (serial fraction of work) = 5%
p (number of processors) = 20
speedup (Amdahl's Law) = 10.26
scaled speedup (Gustafson's Law) = 19.05

===Derivation of Gustafson's Law<ref>http://www.johngustafson.net/pubs/pub13/amdahl.pdf</ref><ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law#Derivation_of_Gustafson.27s_Law</ref>===

If p is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and 1-s is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] says that speedup is given by:

[[Image:Amdahl.png|center|300px]]

Let us consider a bigger problem size of measure n.

The execution of the program on a parallel computer is decomposed into:

[[Image:Eq2.png|center|250px]]

where a is the sequential fraction, 1-s is the parallel fraction, ignoring overhead for now, and p is the number of processors working in parallel during the parallel stage.

The relative time for sequential processing would be '''''s + p (1 - s)''''', where p is the number of processors in the parallel case.

Speedup is therefore:

[[Image:Eq3.png|center|550px]]

where s: = s(n) is the sequential fraction.

Assuming the sequential fraction s(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired.
Thus Gustafson's law seems to rescue parallel processing from Amdahl's law.

Amdahl's law argues that even using massively parallel computer systems cannot influence the sequential part of a fixed workload. Since this part is irreducible, the sequential fraction of the fixed workload is a function of p that approaches 1 for large p. In comparison to that, Gustafson's law is based on the idea that the importance of the sequential part diminishes with a growing workload; and if n is allowed to grow along with p, the sequential fraction will not ultimately dominate.

In contrast with Amdahl's Law, this function is simply a line, and one with much more moderate slope: 1 – p.

It is thus much easier to achieve efficient parallel performance than is implied by Amdahl’s paradigm. The two approaches, fixed-sized and scaled-sized, are contrasted and summarized in Figure 2a and b.

[[Image:Amdahl_slope.png|thumb|center|375px|Figure 2a. Fixed-Size Model: Speedup = 1 / (s + (1-s) / p)]]

[[Image:Gus_slope.png|thumb|center|700px|Figure 2b. Scaled-Size Model: Speedup = s + p (1-s)]]

===A Construction Metaphor===

'''Amdahl's Law approximately suggests:'''

Suppose a 60 feet building is under construction, and 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day. No matter how fast they construct the last 30 feet, it is impossible to achieve average rate of construction as 9 feet / day before completion of the building. Since they had already taken 10 days and there are only 60 feet in total; constructing infinitely fast you would only achieve a rate of 6 feet / day.

'''Gustafson's Law approximately states:'''

Suppose 100 workers are constructing a building at the rate of less than 9 feet / day. Given enough workers and height of building to construct, the average rate of construction can always eventually reach 9 feet / day, no matter how slow the construction had been. For example, 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day, they could achieve this by more workers at the rate of 12 feet / day, for 20 additional days, or at the rate of 15 feet / day, for 10 additional days, and so on.

===Gordon Bell prize <ref>http://techresearch.intel.com/ResearcherDetails.aspx?Id=182</ref> <ref>http://en.wikipedia.org/wiki/Gordon_Bell_Prize</ref>===

[[Image:John_L_Gustafson_CEO.jpg|thumb|right|130px|John Gustafson, circa 2005]]
The Gordon Bell Prizes are a set of awards awarded by the [http://en.wikipedia.org/wiki/Association_for_Computing_Machinery Association for Computing Machinery] in conjunction with the [http://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers Institute of Electrical and Electronics Engineers] each year at the [http://en.wikipedia.org/wiki/Supercomputing_Conference Supercomputing Conference] to recognize outstanding achievement in high-performance computing applications. The main purpose of the award is to acknowledge, reward, and thereby assess the progress of parallel computing. The awards were established in 1987.

The Prizes were preceded by a similar much smaller prize (nominal: $100) by Alan Karp, a numerical analyst (then of IBM; won by Gustafson and Montry) challenging claims of MIMD performance improvements proposed in the Letters to the Editor section of the [http://en.wikipedia.org/wiki/Communications_of_the_ACM Communications of the ACM] who went on to be one of the first Bell Prize judges. Cash prizes accompany these recognitions and are funded by the award founder, [http://en.wikipedia.org/wiki/Gordon_Bell Gordon Bell], a pioneer in high-performance and parallel computing.

[http://en.wikipedia.org/wiki/John_L._Gustafson Dr. John L. Gustafson] introduced the first commercial cluster system in 1985 and having first demonstrated 1000x, scalable parallel performance on real applications in 1988, for which he won the inaugural Gordon Bell Award. That demonstration broke the “Karp Challenge” that claimed speedup of more than 200x was a practical impossibility; it created a watershed that led to the widespread manufacture and use of highly parallel computers.

==Superlinear speedup <ref>http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf</ref>==

[[Image:Superlinear.jpg|thumb|right|350px|Figure 3. Super-linear speedup]]

Not too long ago, the parallel time to solve a given problem using p processors was believed to be no greater than p. However, people then observed that in some computations the speedup was greater than p. When the speedup is higher than p, it is called super-linear speedup. One thing that could hinder the chances of achieving [http://en.wikipedia.org/wiki/Speedup super-linear speedup] is the cost involved in inter-process communication during parallel computation. This is not a concern in serial computation. However, super-linear speedup can be achieved by utilizing the resources very efficiently.

===Controversy===

Talk of super-linear speedup always sparks some controversy. Since super-linear speedup is not possible in theory, some non-orthodox practices could be thought of being the cause for achieving super-linear speedup. This is true especially with regard to the traditional research community. Hence, reporting super-linear speedup is controversial.

===Reasons for super-linear speedup===
Let us look at the reasons for super-linear speedup. The data set of a given problem could be much larger than the cache size when the problem is executed serially. In [http://en.wikipedia.org/wiki/Parallel_computing parallel computation], however, the data set has enough space in each cache that is available. In problems that involve searching a data structure, multiple searches can be executed at the same time. This reduces the termination time. Another reason is the efficient utilization of resources by [http://en.wikipedia.org/wiki/Multiprocessing multiprocessors].

===Parallel Search <ref>http://stackoverflow.com/questions/4332967/where-does-super-linear-speedup-come-from</ref>===
When search is being performed in parallel on multiple processors, the amount of work being done is lesser than the amount of work being done serially. Let us see why this is so:

* The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
* Modern processors have faster and slower memories. The processor will try to keep the data we are using in the fast memory. The amount of our data is most likely larger than the amount of fast memory. If we use n processors we have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time, and hence amount of work to do the same task.
* The original sequential algorithm was really bad
* There are multiple processors at our disposal and hence much more cache is available when compared to serial computation on a single processor. The serial algorithm always runs out of cache space when the data set is very large.
* Getting a serial algorithm to do this parallel work and get better results wouldn't be feasible because the serial algorithm will not utilize the resources efficiently like a parallel algorithm would.

===Super-linearity on a large machine <ref>http://drdobbs.com/article/print?articleId=206903306&siteSectionName=</ref>===

So far we have looked at leveraging the super-linearity that can arise naturally in parallel computation. Now let us think how else we could achieve super-linear speedup. We could use more parallelism on the same machine. To speed up computational work we can increase the number of cores that we use. However, using more cores will only give us linear speedup.

When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine — a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores also means getting access to more [http://en.wikipedia.org/wiki/Cache_%28computing%29 cache] and/or more memory.

Let us take a look at Figure 4 to see why this is so. This figure shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current [http://arstechnica.com/hardware/news/2006/12/8363.ars Intel "Kentsfield" processor], and the upcoming [http://arstechnica.com/hardware/news/2006/12/8363.ars AMD "Barcelona" processor], respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private [http://www.wisegeek.com/what-is-l2-cache.htm L2 cache]; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache, and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a significant difference.

[[Image:superlinear_example.jpg|thumb|center|700px|Figure 4. Intel Kentsfield core and cache utilization: 1 thread versus 3 threads]]

Example:

Suppose we have an 8 processor machine, each processor has a 1MB cache and each computation uses 6MB of data.
On a single processor the computation will be doing a lot of data movement between CPU, cache and RAM.
On 8 processors the computation will only have to move data between CPU and cache.
This way super-linear speedup can be achieved.

==Conclusion==

===Scaled speedup <ref>http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&sqi=2&ved=0CCkQFjAB&url=http%3A%2F%2Fcoitweb.uncc.edu%2F~abw%2FITCS4145F10%2Fslides1a.ppt&ei=rq05T_zOOpKutwe635jRAg&usg=AFQjCNFMNTKEQ2P9G4zV3fGWyaMbmYtsMQ</ref><ref>http://spartan.cis.temple.edu/shi/public_html/docs/amdahl/amdahl.html</ref>===

Using Amdahl's Law as an argument against massively parallel processing is not valid. This is because ''serial parts'' of a program can be very close to zero for many practical applications. Thus very high speedups are possible using massively many processors. Gustafson's experiments are just examples of these applications.

Gustafson's formulation gives an illusion that as if ''number of processors'' can increase indefinitely. A closer look finds that the increase in serial parts of a program is affecting speedup negatively. The rate of speedup decrease as ''number of processors'' approaches infinity, if we translate the scaled-percentage to a non-scaled percentage. We cannot observe the speedup impact by ''number of processors'' using Gustafson's formulation directly since it contains a ''number of processors'' dependent variable serial parts of a program.

Even though Amdahl's law is theoretically correct, the serial percentage is not practically obtainable. For example, if the serial percentage is to be derived from computational experiments, i.e. recording the total parallel elapsed time and the parallel-only elapsed time, then it can contain all overheads, such as communication, synchronization, input/output and memory access. The law offers no help to separate these factors. On the other hand, if we obtain the serial percentage by counting the number of total serial and parallel instructions in a program, then all other overheads are excluded. However, in this case the predicted speedup may never agree with the experiments.

Conclusion drawn from Gustafson’s law is that it should be possible to get high speedup if we scale up the problem size.

===Super-linear speedup===

Even though in theory the maximum possible speedup is equal to the number of parallel processors, in practice we do see speedups higher than the number of processors, or in other words, super-linear speedup. It is considered controversial by purists because of the skepticism regarding the means of achieving this high speedup. If super-linear speedup is achievable, it gives an extremely high degree of parallelism and also provides maximum utilization of resources.

== External links ==
1. [http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg John Gustafson]

2. [http://en.wikipedia.org/wiki/File:Gustafson.png Figure 1 - Plot of speedup vs processors - Gustafson's Law]

3. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Figure 3 - Superlinear speedup]

4. [http://en.wikipedia.org/wiki/Speedup Superlinear speedup wiki]

5. [http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCIQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.80.9410%26rep%3Drep1%26type%3Dpdf&ei=9qM5T7KpE4jEtwe5-JmtAg&usg=AFQjCNGqiAI5QMYuYyp9ePmRKFNBC31g7g Controversy]

6. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Figure 4 - Intel Kentsfield core and cache utilization: 1 thread versus 3 threads.]

7. [http://arstechnica.com/hardware/news/2006/12/8363.ars AMD "Barcelona" processor]

8. [http://www.xbitlabs.com/articles/cpu/display/kentsfield-preview.html Intel "Kentsfield" processor]

==References==
<references/>

CSC/ECE 506 Spring 2012/4b rs

2012-02-14T00:01:12Z

Sshanbh: /* Super-linearity on a large machine */

'''The limits to speedup'''

== Introduction <ref>http://en.wikipedia.org/wiki/Amdahl%27s_law</ref> ==
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm. Parallel computing gains very high importance in scientific computations because of its good speedup. More so in those computations that involve large-scaled data. As far as the need for parallel computing goes, one claim is that we can always double the speed of a chip every 18 months according to Moore’s Law. This means there is no need to develop parallel computation. This claim however, has been proved to be wrong.

According to [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. But this solves a fixed problem in the shortest possible period of time, rather than solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed "reasonable" amount of time. To overcome these shortcomings, [http://en.wikipedia.org/wiki/John_L._Gustafson John L. Gustafson] and his colleague Edwin H. Barsis described [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Gustafson's Law], which provides a counterpoint to Amdahl's law, which describes a limit on the speed-up that parallelization can provide, given a fixed data set size.

== Scaled speedup==

Scaled speedup is the speedup that can be achieved by increasing the data size. This increase in data size is done to solve a given problem on multiple parallel processors. In other words, with larger number of parallel processors at our disposal, we can increase the data size of the same problem and achieve higher speedup. This is what is referred to as scaled speedup. Means of achieving this speedup are exploited by Gustafson's Law.

===Gustafson's Law <ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law</ref>===

[[Image:Gustafson.png|thumb|right|350px|Figure 1. Gustafson's Law]]

Gustafson's Law says that it is possible to parallelize computations when they involve significantly large data sets. It says that there is skepticism regarding the viability of massive parallelism. This skepticism is largely due to Amdahl's law, which says that the maximum speedup that can be achieved in a given problem with serial fraction of work s, is 1/s, even when the number of processors increases to an infinite number. For example, if 5% of computation in a problem is serial, then the maximum achievable speedup is 20 regardless of the number of processors. This is not a very encouraging result.

Amdahl's law does not fully exploit the computing power that becomes available as the number of machines increases. Gustafson's law addresses
this limitation. It considers the effect of increasing the problem size. Gustafson reasoned that when a problem is ported onto a multiprocessor system, it is possible to consider larger problem sizes. In other words, the same problem with a larger number of data values takes the same time. The law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a
practical fixed time. Larger problems can be solved in the same time if faster, i.e. more parallel equipment is available. Therfore, it should be possible to achieve high speedup if we scale the problem size.

Example:
s (serial fraction of work) = 5%
p (number of processors) = 20
speedup (Amdahl's Law) = 10.26
scaled speedup (Gustafson's Law) = 19.05

===Derivation of Gustafson's Law<ref>http://www.johngustafson.net/pubs/pub13/amdahl.pdf</ref><ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law#Derivation_of_Gustafson.27s_Law</ref>===

If p is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and 1-s is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] says that speedup is given by:

[[Image:Amdahl.png|center|300px]]

Let us consider a bigger problem size of measure n.

The execution of the program on a parallel computer is decomposed into:

[[Image:Eq2.png|center|250px]]

where a is the sequential fraction, 1-s is the parallel fraction, ignoring overhead for now, and p is the number of processors working in parallel during the parallel stage.

The relative time for sequential processing would be '''''s + p (1 - s)''''', where p is the number of processors in the parallel case.

Speedup is therefore:

[[Image:Eq3.png|center|550px]]

where s: = s(n) is the sequential fraction.

Assuming the sequential fraction s(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired.
Thus Gustafson's law seems to rescue parallel processing from Amdahl's law.

Amdahl's law argues that even using massively parallel computer systems cannot influence the sequential part of a fixed workload. Since this part is irreducible, the sequential fraction of the fixed workload is a function of p that approaches 1 for large p. In comparison to that, Gustafson's law is based on the idea that the importance of the sequential part diminishes with a growing workload; and if n is allowed to grow along with p, the sequential fraction will not ultimately dominate.

In contrast with Amdahl's Law, this function is simply a line, and one with much more moderate slope: 1 – p.

It is thus much easier to achieve efficient parallel performance than is implied by Amdahl’s paradigm. The two approaches, fixed-sized and scaled-sized, are contrasted and summarized in Figure 2a and b.

[[Image:Amdahl_slope.png|thumb|center|375px|Figure 2a. Fixed-Size Model: Speedup = 1 / (s + (1-s) / p)]]

[[Image:Gus_slope.png|thumb|center|700px|Figure 2b. Scaled-Size Model: Speedup = s + p (1-s)]]

===A Construction Metaphor===

'''Amdahl's Law approximately suggests:'''

Suppose a 60 feet building is under construction, and 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day. No matter how fast they construct the last 30 feet, it is impossible to achieve average rate of construction as 9 feet / day before completion of the building. Since they had already taken 10 days and there are only 60 feet in total; constructing infinitely fast you would only achieve a rate of 6 feet / day.

'''Gustafson's Law approximately states:'''

Suppose 100 workers are constructing a building at the rate of less than 9 feet / day. Given enough workers and height of building to construct, the average rate of construction can always eventually reach 9 feet / day, no matter how slow the construction had been. For example, 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day, they could achieve this by more workers at the rate of 12 feet / day, for 20 additional days, or at the rate of 15 feet / day, for 10 additional days, and so on.

===Gordon Bell prize <ref>http://techresearch.intel.com/ResearcherDetails.aspx?Id=182</ref> <ref>http://en.wikipedia.org/wiki/Gordon_Bell_Prize</ref>===

[[Image:John_L_Gustafson_CEO.jpg|thumb|right|130px|John Gustafson, circa 2005]]
The Gordon Bell Prizes are a set of awards awarded by the [http://en.wikipedia.org/wiki/Association_for_Computing_Machinery Association for Computing Machinery] in conjunction with the [http://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers Institute of Electrical and Electronics Engineers] each year at the [http://en.wikipedia.org/wiki/Supercomputing_Conference Supercomputing Conference] to recognize outstanding achievement in high-performance computing applications. The main purpose of the award is to acknowledge, reward, and thereby assess the progress of parallel computing. The awards were established in 1987.

The Prizes were preceded by a similar much smaller prize (nominal: $100) by Alan Karp, a numerical analyst (then of IBM; won by Gustafson and Montry) challenging claims of MIMD performance improvements proposed in the Letters to the Editor section of the [http://en.wikipedia.org/wiki/Communications_of_the_ACM Communications of the ACM] who went on to be one of the first Bell Prize judges. Cash prizes accompany these recognitions and are funded by the award founder, [http://en.wikipedia.org/wiki/Gordon_Bell Gordon Bell], a pioneer in high-performance and parallel computing.

[http://en.wikipedia.org/wiki/John_L._Gustafson Dr. John L. Gustafson] introduced the first commercial cluster system in 1985 and having first demonstrated 1000x, scalable parallel performance on real applications in 1988, for which he won the inaugural Gordon Bell Award. That demonstration broke the “Karp Challenge” that claimed speedup of more than 200x was a practical impossibility; it created a watershed that led to the widespread manufacture and use of highly parallel computers.

==Superlinear speedup <ref>http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf</ref>==

[[Image:Superlinear.jpg|thumb|right|350px|Figure 3. Super-linear speedup]]

Not too long ago, the parallel time to solve a given problem using p processors was believed to be no greater than p. However, people then observed that in some computations the speedup was greater than p. When the speedup is higher than p, it is called super-linear speedup. One thing that could hinder the chances of achieving [http://en.wikipedia.org/wiki/Speedup super-linear speedup] is the cost involved in inter-process communication during parallel computation. This is not a concern in serial computation. However, super-linear speedup can be achieved by utilizing the resources very efficiently.

===Controversy===

Talk of super-linear speedup always sparks some controversy. Since super-linear speedup is not possible in theory, some non-orthodox practices could be thought of being the cause for achieving super-linear speedup. This is true especially with regard to the traditional research community. Hence, reporting super-linear speedup is controversial.

===Reasons for super-linear speedup===
Let us look at the reasons for super-linear speedup. The data set of a given problem could be much larger than the cache size when the problem is executed serially. In parallel computation, however, the data set has enough space in each cache that is available. In problems that involve searching a data structure, multiple searches can be executed at the same time. This reduces the termination time. Another reason is the efficient utilization of resources by multiprocessors.

===Parallel Search===
When search is being performed in parallel on multiple processors, the amount of work being done is lesser than the amount of work being done serially. Let us see why this is so:

* The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
* Modern processors have faster and slower memories. The processor will try to keep the data we are using in the fast memory. The amount of our data is most likely larger than the amount of fast memory. If we use n processors we have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time, and hence amount of work to do the same task.
* The original sequential algorithm was really bad
* There are multiple processors at our disposal and hence much more cache is available when compared to serial computation on a single processor. The serial algorithm always runs out of cache space when the data set is very large.
* Getting a serial algorithm to do this parallel work and get better results wouldn't be feasible because the serial algorithm will not utilize the resources efficiently like a parallel algorithm would.

===Super-linearity on a large machine===

So far we have looked at leveraging the super-linearity that can arise naturally in parallel computation. Now let us think how else we could achieve super-linear speedup. We could use more parallelism on the same machine. To speed up computational work we can increase the number of cores that we use. However, using more cores will only give us linear speedup.

When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine — a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores also means getting access to more cache and/or more memory.

Let us take a look at Figure 4 to see why this is so. This figure shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current Intel "Kentsfield" processor, and the upcoming AMD "Barcelona" processor, respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private L2 cache; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache, and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a significant difference.

[[Image:superlinear_example.jpg|thumb|center|700px|Figure 4. Intel Kentsfield core and cache utilization: 1 thread versus 3 threads]]

Example:

Suppose we have an 8 processor machine, each processor has a 1MB cache and each computation uses 6MB of data.
On a single processor the computation will be doing a lot of data movement between CPU, cache and RAM.
On 8 processors the computation will only have to move data between CPU and cache.
This way super-linear speedup can be achieved.

==Conclusion==

== External links ==
1. [http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg John Gustafson]

2. [http://en.wikipedia.org/wiki/File:Gustafson.png Plot of speedup vs processors - Gustafson's Law]

3. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Superlinear speedup]

4. [http://en.wikipedia.org/wiki/Speedup Superlinear speedup wiki]

==References==
<references/>

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T23:53:38Z

Sshanbh: /* Superlinear speedup http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf */

'''The limits to speedup'''

== Introduction <ref>http://en.wikipedia.org/wiki/Amdahl%27s_law</ref> ==
In parallel computing, speedup refers to how much a parallel algorithm is faster than a corresponding sequential algorithm. Parallel computing gains very high importance in scientific computations because of its good speedup. More so in those computations that involve large-scaled data. As far as the need for parallel computing goes, one claim is that we can always double the speed of a chip every 18 months according to Moore’s Law. This means there is no need to develop parallel computation. This claim however, has been proved to be wrong.

According to [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] the speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. But this solves a fixed problem in the shortest possible period of time, rather than solving the largest possible problem (e.g., the most accurate possible approximation) in a fixed "reasonable" amount of time. To overcome these shortcomings, [http://en.wikipedia.org/wiki/John_L._Gustafson John L. Gustafson] and his colleague Edwin H. Barsis described [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Gustafson's Law], which provides a counterpoint to Amdahl's law, which describes a limit on the speed-up that parallelization can provide, given a fixed data set size.

== Scaled speedup==

Scaled speedup is the speedup that can be achieved by increasing the data size. This increase in data size is done to solve a given problem on multiple parallel processors. In other words, with larger number of parallel processors at our disposal, we can increase the data size of the same problem and achieve higher speedup. This is what is referred to as scaled speedup. Means of achieving this speedup are exploited by Gustafson's Law.

===Gustafson's Law <ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law</ref>===

[[Image:Gustafson.png|thumb|right|350px|Figure 1. Gustafson's Law]]

Gustafson's Law says that it is possible to parallelize computations when they involve significantly large data sets. It says that there is skepticism regarding the viability of massive parallelism. This skepticism is largely due to Amdahl's law, which says that the maximum speedup that can be achieved in a given problem with serial fraction of work s, is 1/s, even when the number of processors increases to an infinite number. For example, if 5% of computation in a problem is serial, then the maximum achievable speedup is 20 regardless of the number of processors. This is not a very encouraging result.

Amdahl's law does not fully exploit the computing power that becomes available as the number of machines increases. Gustafson's law addresses
this limitation. It considers the effect of increasing the problem size. Gustafson reasoned that when a problem is ported onto a multiprocessor system, it is possible to consider larger problem sizes. In other words, the same problem with a larger number of data values takes the same time. The law proposes that programmers tend to set the size of problems to use the available equipment to solve problems within a
practical fixed time. Larger problems can be solved in the same time if faster, i.e. more parallel equipment is available. Therfore, it should be possible to achieve high speedup if we scale the problem size.

Example:
s (serial fraction of work) = 5%
p (number of processors) = 20
speedup (Amdahl's Law) = 10.26
scaled speedup (Gustafson's Law) = 19.05

===Derivation of Gustafson's Law<ref>http://www.johngustafson.net/pubs/pub13/amdahl.pdf</ref><ref>http://en.wikipedia.org/wiki/Gustafson%27s_Law#Derivation_of_Gustafson.27s_Law</ref>===

If p is the number of processors, s is the amount of time spent (by a serial processor) on serial parts of a program and 1-s is the amount of time spent (by a serial processor) on parts of the program that can be done in parallel, then [http://en.wikipedia.org/wiki/Gustafson%27s_law#Derivation_of_Gustafson.27s_Law Amdahl's law] says that speedup is given by:

[[Image:Amdahl.png|center|300px]]

Let us consider a bigger problem size of measure n.

The execution of the program on a parallel computer is decomposed into:

[[Image:Eq2.png|center|250px]]

where a is the sequential fraction, 1-s is the parallel fraction, ignoring overhead for now, and p is the number of processors working in parallel during the parallel stage.

The relative time for sequential processing would be '''''s + p (1 - s)''''', where p is the number of processors in the parallel case.

Speedup is therefore:

[[Image:Eq3.png|center|550px]]

where s: = s(n) is the sequential fraction.

Assuming the sequential fraction s(n) diminishes with problem size n, then speedup approaches p as n approaches infinity, as desired.
Thus Gustafson's law seems to rescue parallel processing from Amdahl's law.

Amdahl's law argues that even using massively parallel computer systems cannot influence the sequential part of a fixed workload. Since this part is irreducible, the sequential fraction of the fixed workload is a function of p that approaches 1 for large p. In comparison to that, Gustafson's law is based on the idea that the importance of the sequential part diminishes with a growing workload; and if n is allowed to grow along with p, the sequential fraction will not ultimately dominate.

In contrast with Amdahl's Law, this function is simply a line, and one with much more moderate slope: 1 – p.

It is thus much easier to achieve efficient parallel performance than is implied by Amdahl’s paradigm. The two approaches, fixed-sized and scaled-sized, are contrasted and summarized in Figure 2a and b.

[[Image:Amdahl_slope.png|thumb|center|375px|Figure 2a. Fixed-Size Model: Speedup = 1 / (s + (1-s) / p)]]

[[Image:Gus_slope.png|thumb|center|700px|Figure 2b. Scaled-Size Model: Speedup = s + p (1-s)]]

===A Construction Metaphor===

'''Amdahl's Law approximately suggests:'''

Suppose a 60 feet building is under construction, and 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day. No matter how fast they construct the last 30 feet, it is impossible to achieve average rate of construction as 9 feet / day before completion of the building. Since they had already taken 10 days and there are only 60 feet in total; constructing infinitely fast you would only achieve a rate of 6 feet / day.

'''Gustafson's Law approximately states:'''

Suppose 100 workers are constructing a building at the rate of less than 9 feet / day. Given enough workers and height of building to construct, the average rate of construction can always eventually reach 9 feet / day, no matter how slow the construction had been. For example, 100 workers have spent 10 days to construct 30 feet at the rate of 3 feet / day, they could achieve this by more workers at the rate of 12 feet / day, for 20 additional days, or at the rate of 15 feet / day, for 10 additional days, and so on.

===Gordon Bell prize <ref>http://techresearch.intel.com/ResearcherDetails.aspx?Id=182</ref> <ref>http://en.wikipedia.org/wiki/Gordon_Bell_Prize</ref>===

[[Image:John_L_Gustafson_CEO.jpg|thumb|right|130px|John Gustafson, circa 2005]]
The Gordon Bell Prizes are a set of awards awarded by the [http://en.wikipedia.org/wiki/Association_for_Computing_Machinery Association for Computing Machinery] in conjunction with the [http://en.wikipedia.org/wiki/Institute_of_Electrical_and_Electronics_Engineers Institute of Electrical and Electronics Engineers] each year at the [http://en.wikipedia.org/wiki/Supercomputing_Conference Supercomputing Conference] to recognize outstanding achievement in high-performance computing applications. The main purpose of the award is to acknowledge, reward, and thereby assess the progress of parallel computing. The awards were established in 1987.

The Prizes were preceded by a similar much smaller prize (nominal: $100) by Alan Karp, a numerical analyst (then of IBM; won by Gustafson and Montry) challenging claims of MIMD performance improvements proposed in the Letters to the Editor section of the [http://en.wikipedia.org/wiki/Communications_of_the_ACM Communications of the ACM] who went on to be one of the first Bell Prize judges. Cash prizes accompany these recognitions and are funded by the award founder, [http://en.wikipedia.org/wiki/Gordon_Bell Gordon Bell], a pioneer in high-performance and parallel computing.

[http://en.wikipedia.org/wiki/John_L._Gustafson Dr. John L. Gustafson] introduced the first commercial cluster system in 1985 and having first demonstrated 1000x, scalable parallel performance on real applications in 1988, for which he won the inaugural Gordon Bell Award. That demonstration broke the “Karp Challenge” that claimed speedup of more than 200x was a practical impossibility; it created a watershed that led to the widespread manufacture and use of highly parallel computers.

==Superlinear speedup <ref>http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf</ref>==

[[Image:Superlinear.jpg|thumb|right|350px|Figure 3. Super-linear speedup]]

Not too long ago, the parallel time to solve a given problem using p processors was believed to be no greater than p. However, people then observed that in some computations the speedup was greater than p. When the speedup is higher than p, it is called super-linear speedup. One thing that could hinder the chances of achieving [http://en.wikipedia.org/wiki/Speedup super-linear speedup] is the cost involved in inter-process communication during parallel computation. This is not a concern in serial computation. However, super-linear speedup can be achieved by utilizing the resources very efficiently.

===Controversy===

Talk of super-linear speedup always sparks some controversy. Since super-linear speedup is not possible in theory, some non-orthodox practices could be thought of being the cause for achieving super-linear speedup. This is true especially with regard to the traditional research community. Hence, reporting super-linear speedup is controversial.

===Reasons for super-linear speedup===
Let us look at the reasons for super-linear speedup. The data set of a given problem could be much larger than the cache size when the problem is executed serially. In parallel computation, however, the data set has enough space in each cache that is available. In problems that involve searching a data structure, multiple searches can be executed at the same time. This reduces the termination time. Another reason is the efficient utilization of resources by multiprocessors.

===Parallel Search===
When search is being performed in parallel on multiple processors, the amount of work being done is lesser than the amount of work being done serially. Let us see why this is so:

* The parallel algorithm uses some search like a random walk, the more processors that are walking, the less distance has to be walked in total before you reach what you are looking for.
* Modern processors have faster and slower memories. The processor will try to keep the data we are using in the fast memory. The amount of our data is most likely larger than the amount of fast memory. If we use n processors we have n times the amount of faster memory. More data fits in the fast memory which makes it possible to take less time, and hence amount of work to do the same task.
* The original sequential algorithm was really bad
* There are multiple processors at our disposal and hence much more cache is available when compared to serial computation on a single processor. The serial algorithm always runs out of cache space when the data set is very large.
* Getting a serial algorithm to do this parallel work and get better results wouldn't be feasible because the serial algorithm will not utilize the resources efficiently like a parallel algorithm would.

===Super-linearity on a large machine===

Besides leveraging the superlinearity that can crop up naturally in parallel algorithms, how else could we ever achieve a superlinear speedup by using more parallelism on the same machine? After all, more parallelism means we can use more cores to speed up compute-bound work, but just using more cores in itself gives us only a linear speedup. Right?

The fallacy is in the question's final words: "...on the same machine." When you run a program with less parallelism and another with more parallelism on the same modern desktop or server hardware, the one with more parallelism literally runs on a bigger machine—a disproportionately larger share of the available hardware. This happens because the one with more parallelism can use not only additional cores, but additional hardware resources attached to those cores that would not otherwise be available to the program. In particular, using more cores typically also means getting access to more cache and/or more memory.

To see why this is so, consider Figure 2 and Figure 3. Each of these shows a simplified block diagram of the cores and caches on two modern commodity CPUs: the current Intel "Kentsfield" processor, and the upcoming AMD "Barcelona" processor, respectively. The interesting feature in both chips is that each core has access to cache memory that is not available to some or all of the other cores. In the Kentsfield processor, each pair of cores shares a private L2 cache; in the Barcelona chip, each core has its own private L2 cache. In both cases, no core by itself has access to all the available L2 cache...and that means that code running on just one core is limited, not only to the one core, but also to just a fraction of the available cache. For code whose performance is memory bound, the amount of available cache can make a huge difference.

[[Image:superlinear_example.jpg|thumb|center|700px|Figure 4. Intel Kentsfield core and cache utilization: 1 thread versus 3 threads]]

Example:

Suppose we have an 8 processor machine, each processor has a 1MB cache and each computation uses 6MB of data.
On a single processor the computation will be doing a lot of data movement between CPU, cache and RAM.
On 8 processors the computation will only have to move data between CPU and cache.
This way super-linear speedup can be achieved.

==Conclusion==

== External links ==
1. [http://en.wikipedia.org/wiki/File:John_L_Gustafson_CEO.jpg John Gustafson]

2. [http://en.wikipedia.org/wiki/File:Gustafson.png Plot of speedup vs processors - Gustafson's Law]

3. [http://drdobbs.com/article/print?articleId=206903306&siteSectionName= Superlinear speedup]

4. [http://en.wikipedia.org/wiki/Speedup Superlinear speedup wiki]

==References==
<references/>

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T23:33:32Z

Sshanbh: /* Superlinear speedup http://www.ccs.neu.edu/course/com3620/projects/scalable/jshan/final1.pdf */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T23:24:57Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T23:09:03Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T23:05:16Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T22:48:53Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T22:40:01Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T22:39:21Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T22:38:59Z

Sshanbh: /* Superlinear speedup */

CSC/ECE 506 Spring 2012/4b rs

2012-02-13T22:32:02Z

Sshanbh: /* Superlinear speedup */