CSC/ECE 506 Spring 2013/10b ps

From Expertiza_Wiki
Revision as of 00:14, 4 April 2013 by Ppraman2 (talk | contribs)
Jump to navigation Jump to search

Topic_Writeup
Past Wiki 1
Past Wiki 2

Introduction

In this wiki we seek to perform a survey of memory consistency models in modern microprocessors. We will begin by giving a brief synopsis of what memory consistency is and how it can be achieved. We will also discuss some legacy systems that utilized some initial consistency models and then discuss how those models evolved into more sophisticated versions in more recent years. After we have this foundation we will move into more advanced architectures and implementations typically seen today.

What is Memory Consistency

In order for our discussion to begin we must understand what a Memory Consistency Model is. The Memory Consistency Model specifies how memory behaves when multiple processors do reads and writes to the same location. Because this concept has such a vast scope for such a critical role to the system, memory consistency impacts system design, programming languages, and underlying hardware.

This is critically important because we want our systems to be as efficient as possible. In order to achieve this goal there are various optimizations the system performs to overlap operations and allow these operations to complete out-of-order. To add to this challenge, there are also several optimizations the compiler can perform to increase execution efficiency. It is intuitive to think that instructions are executed sequentially but in reality this execution really only needs to be sequential where operations operate on the same memory location. As multi-processor systems and parallel computing became more of a necessity in the technology industry the necessity for correct and highly efficient memory consistency models came into place. Memory consistency allows for techniques to optimize execution but also puts a contract in place to ensure correctness between the programmer and the system. In short, allowing tasks to execute in parallel is great for improving performance, but it also introduces new challenges because these tasks can now complete in different order than how they were initially conceived [16, 10].

How Memory Consistency Used to be Done

It is intuitive to think of operations executing on a multi-processor system to behave the same as instructions executing on a single processor system. This an interleaving model called Sequential Consistency and is the classical approach to consistency can be achieved. It was formally defined by Lamport as, “The result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” What this means is that all memory access are immediately visible to all processors.

Conceptual Model for ARM Process Execution

This type of system was seen in the 1980’s and present in system like the 386-based Compaq SystemPro. The key way Sequential Consistency was able to achieve overlap was to put writes in a write buffer so that the computation can be overlapped to the next read. Memory operations are issued in program order by processors. Because operations are executed one at a time they appear to be atomic with respect to each other.

The advantages to a model like this is that it is easy to understand from a programmer perspective. Unfortunately the rules put in place by this model are very restrictive and can negatively impact performance. Sequential Consistency disallows the reordering of shared memory operations between processors. The drawbacks of this strict model produces high latency and very little benefit of true serial execution. This is because reads tend to be closely interleaved with writes yielding very little utilization of the write buffer.

Memory Consistency Today

While maintaining Sequential Consistency guarantees correctness, it is very limiting in terms of performance as only one instruction may be executed at a time and executing any instructions requiring memory access atomically. As a result, widely implemented computer architectures try to reduce the requirements imposed by the Sequential Consistency model while still maintaining program correctness. These Relaxed Consistency models allows for the processor to make adjustments in the order of memory accesses in order to execute the instructions more efficiently.


Relaxed models can be categorized based on how they relax the program order constraint. Below are three general categories we will be examining during this discussion:

1. Write to Read relaxation: allow a write followed by a read to execute out of program order. Examples of models that provide this relaxation include the IBM-370 and the Sun SPARC V8 Total Store Ordering (TSO).

2. Write to Write relaxation: allows two writes to execute out of program order. An example of a model that provides this relaxation is the Sun SPARC V8 Partial Store Ordering (PSO).

3. Read to Read, Write: allows reads to execute out of program order with respect to their following reads and writes. Examples of models that provide this relaxation include the DEC Alpha (Alpha), the Sun SPARC V9 Relaxed Memory Order (RMO), and the IBM PowerPC (PowerPC).

Apart from the program order relaxation, some of these models also relax the write atomicity requirement. Some examples of this type of this type of consistency requirement relaxation are

4. Read others’ write early: allow a read to return the value of another processor’s write before the write is made visible to all other processors for cache-based systems.

5. Read own write early: a processor is allowed to read the value of its own previous write before the write is made visible to other processors.

Implementations

Below we look at some Relaxed Consistency models that were implemented commercial systems. We divide our discussion of commercialized systems into two major categories, the more traditional computing architectures that are found in servers, large clusters, and in personal computers, and the newly emerging mobile devices which are becoming more sophisticated each day.

Note that a single architecture may employ multiple Relaxed Consistency models, we are just selecting some examples that illustrate how each memory consistency model works.

Commercial Computers

Mobile Devices

Processors that are designed to be quick, low power, low heat, and capable of supporting many applications are becoming very popular in today’s technology space. From tablets to smartphones, many of these devices are run on ARM, and to a lesser extent IBM POWER architecture. This type of multiprocessors utilize highly relaxed memory models. This allows a wide range of processor optimizations including better performance, better power usage, and simpler hardware. The pillars that ARM architecture are built upon to achieve as much hardware optimization as possible is:

1. Reads and Write can be done out of order. This includes speculative reads and writes that are a part of certain branches not yet reached. Many other relaxed memory models only allow local reordering under specific conditions, but in ARM any local reordering is allowed unless conditions are implemented by the programer.
2. The memory system does not guarantee that writes will become visible to all other hardware threads at the same point.

In order to conceptualize the architecture think of each as effectively having it’s own copy of memory. A write by one thread may propagate to other threads in any order. When this propagation happens they may be interleaved arbitrarily. This allows for great hardware execution optimization but also put a responsibility on the programmers shoulders to instill barriers if they want to constrain this behavior.

Conceptual ARM architecture

Given these properties, a greater focus is put on the programmer-observable behavior. This is the behavior exhibited by a multithreaded program as it is executed and analyzing the results. This is not related to performance but instead the correctness of the results. The looseness of the ARM architecture must be taken into account by the programmer and they must realize that their multithreaded application may exhibit different observable behavior given the conditions it is run under.

ARM Thread level execution

Quiz

References

1. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html
2. http://pic.dhe.ibm.com/infocenter/lnxinfo/v3r0m0/topic/liaaw/ordering.2006.03.13a.pdf
3. http://dl.acm.org/citation.cfm?id=1941553.1941594
4. http://home.deib.polimi.it/speziale/tech/memory_consistency/mc.html
5. http://homes.cs.washington.edu/~djg/papers/asf_micro2010.pdf
6. http://people.ee.duke.edu/~sorin/papers/asplos10_consistency.pdf
7. http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf
8. http://en.wikipedia.org/wiki/Memory_ordering
9. http://www.montefiore.ulg.ac.be/~pw/cours/psfiles/struct-cours12-e.pdf
10. http://www2.engr.arizona.edu/~ece569a/Readings/ppt/memory.ppt
11. http://ipdps.cc.gatech.edu/2000/fmppta/18000989.pdf
12. http://preshing.com/20120930/weak-vs-strong-memory-models
13. http://csg.csail.mit.edu/pubs/memos/Memo-493/memo-493.pdf
14. http://classes.soe.ucsc.edu/cmpe221/Spring05/papers/29multi.pdf
15. http://os.inf.tu-dresden.de/Studium/DOS/SS2009/04-Coherency.pdf
16. http://infolab.stanford.edu/pub/cstr/reports/csl/tr/95/685/CSL-TR-95-685.pdf
17. http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf
18. http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf
19. http://www.cs.utah.edu/formal_verification/publications/conferences/pdf/charme03.pdf
20. http://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-paper.tphols.pdf