CSC/ECE 506 Spring 2011/ch7 ss: Difference between revisions
(22 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization]. | Though the migration from [http://en.wikipedia.org/wiki/Uniprocessor_system uniprocessor system] to [http://en.wikipedia.org/wiki/Multiprocessing multiprocessing] systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: [http://en.wikipedia.org/wiki/Cache_coherence cache coherence], memory consistency and [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 synchronization]. | ||
Line 56: | Line 55: | ||
The MOESI protocol is a combination of the MESI and MOSI protocols. | The MOESI protocol is a combination of the MESI and MOSI protocols. | ||
== Translation Look-aside Buffer (TLB) Coherence == | |||
A TLB is a on-chip cache of the page table entries (PTEs) used to speed up virtual-to-physical-address translation. A PTE can reside in TLBs of multiple processors due to actual sharing or process migration. PTEs can be modified (during page swapping or changing its protection) leading to the direct analog of cache coherence problem. | |||
A variety of solutions have been proposed to solve this problem based on whether hardware loads PTEs directly into the TLB or if PTEs are loaded into TLBs under software control and how several variables in TLBs and operating system are implemented. Software solutions (through operating system) are more widely used, as TLB coherence operations are less frequent than cache coherence operations. Hardware solutions are also used by few systems when TLB operations are not visible to software. | |||
Some of the approaches for implementing TLB coherence are as listed below: | |||
* Virtually addressed caches | |||
* Software TLB shootdown | |||
* Address space identifiers (ASIDs) | |||
* Hardware TLB coherence | |||
= Memory Consistency = | |||
''Memory consistency'' deals with the ordering of all memory operations - loads and stores - to different memory locations. This exists in systems without caches as well. The code below from Solihin shows the problem existing in case of memory inconsistency: | |||
<pre> | |||
P0 P1 | |||
S1 : datum = 5 S3 : while(!datumIsReady) {} | |||
S2 : datumIsReady = 1 S4 : print datum | |||
</pre> | |||
In this example, P0 generates a certain piece of data and then signals (by setting a certain memory location to 1) that this data is now ready. P1 is spinning in a while waiting for this data ready flag to become 1, and then it will print the data. It is important that the compiler understand what is going on so that the program order is preserved. In C, this can be accomplished by declaring certain variables as volatile. | |||
== Memory Consistency Models == | |||
The memory consistency model of a shared-memory multiprocessor is a formal specification of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read in a shared-memory program execution. | |||
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially. | |||
=== Memory semantics in Uniprocessor systems === | |||
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order specified by the program. Thus, the programmer can expect a read to return the value of the last write to the same location before it by the sequential program order. It is sufficient to only maintain uniprocessor data and control dependences. The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4] | |||
=== Memory semantics in multiprocessor systems === | |||
Programmer's implicit expectations are: | |||
* memory accesses in a processor takes place according to the program order. | |||
* Each memory access is performed atomically. | |||
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, '''''weak''''' consistency models are deployed to improve performance. The advanatges of such models are: | |||
* They support out-of-order execution within individual CPUs | |||
* Relaxes latency issues with near-simultaneous accesses by different CPUs | |||
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software: | |||
* linearizability (also known as strict or atomic consistency) | |||
* sequential consistency | |||
* causal consistency | |||
* release consistency | |||
* eventual consistency | |||
* delta consistency | |||
* PRAM consistency (also known as FIFO consistency) | |||
* weak consistency | |||
* vector-field consistency | |||
* fork consistency | |||
* serializability | |||
* one-copy serializability | |||
* entry consistency | |||
== Memory Coherence and Shared Virtual Memory == | |||
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5] | |||
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page | |||
fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence. | |||
Memory coherence strategies are classified based on how they deal with '''''page synchronization''''' and '''''page ownership'''''. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So ''page table'' becomes an important part of these protocols. | |||
= Synchronization = | = Synchronization = | ||
'''Process synchronization''', which is the mainly concerned with different processes committing to a a certain sequence of actions, and '''data synchronization''', which deals with maintaining data integrity across various copies of a dataset, are two two distinct but related concepts. Process synchronization primitives can be used to implement data synchronization. | |||
[http://en.wikipedia.org/wiki/Mutual_exclusion Mutual exclusion] is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system: | |||
* Disabling interrupts | |||
* Locks | |||
* Mutex | |||
* Semaphores | |||
* Barriers | |||
* Test and Set | |||
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion. | |||
== Peterson's algorithm == | |||
This algorithm utilizes the simple idea of combining per-thread '''flags''' to indicate the intent to enter the lock and the '''turn''' variable to determine which thread should enter in the rare case that both wish to enter at the same time. The algorithm is as shown below: [3] | |||
<pre> | |||
void init() { | |||
flag[0] = flag[1] = 0; // 1 -> thread wants to acquire lock (intent) | |||
turn = 0; // whose turn is it? (thread 0 or thread 1?) | |||
} | |||
void lock() { | |||
flag[self] = 1; | |||
turn = 1 - self; // be generous: make it the other thread’s turn | |||
while ((flag[1-self] == 1) && (turn == 1 - self)) | |||
; // spin-wait while other thread has intent | |||
// AND it is other thread’s turn | |||
} | |||
void unlock() { | |||
flag[self] = 0; // simply undo your intent | |||
} | |||
</pre> | |||
The advantages of this approach are: | |||
* It does not assume much about the underlying hardware. It relies on the fact that load and store instructions are atomic( even across processors) and they execute in an order. | |||
* It does not require special instructions. | |||
Though Peterson's algorithm has the above advantages, there are a few problems that render it impractical for use: | |||
* Spin-waiting is possible which causes a thread to spend lot of its CPU time waiting for another thread to release the lock. | |||
* Algorithm might not work well with out-of-order execution of instructions supported by most of the modern processors. | |||
* The algorithm also lacks scalability. | |||
Thus, only the software solutions cannot work well. Sufficient amount of hardware and OS support is needed to get the locking work properly. | |||
== Hardware support == | |||
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a ''pessimistic'' approach. In contrast, the ''optimistic'' approach proceeds with an update, hoping that it can be completed without any interference. This requires ''collision detection'' during the update. The optimistic approach is thus, more efficient in fine-grained operations. | |||
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like [http://en.wikipedia.org/wiki/Test_and_Test-and-set test-and-set], [http://en.wikipedia.org/wiki/Fetch-and-add fetch-and-increment] and [http://en.wikipedia.org/wiki/Swap_%28computer_science%29 swap] were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as [http://en.wikipedia.org/wiki/Compare-and-swap compare-and-swap], [http://en.wikipedia.org/wiki/Load-link/store-conditional LL/SC] etc. for the same. | |||
= References = | = References = | ||
Line 64: | Line 193: | ||
[2] http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html | [2] http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html | ||
[3] http://pages.cs.wisc.edu/~remzi/OSFEP/threads-locks.pdf | |||
[4] Sarita V. Adve. Kourosh Gharachorloo. "Shared Memory. Consistency Models: A Tutorial." Digital Western Research Laboratory 250 University Avenue Palo Alto. <http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf> | |||
[5] Kai Li, and Paul Hudak. "Memory coherence in shared virtual memory systems". Published in Journal ACM Transactions on Computer Systems (TOCS), Volume 7 Issue 4, Nov. 1989 | |||
= See Also = | |||
[1] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass. 1974. | |||
[2] Archibald, J., and Baer, J. An economical solution to the cache coherence problem. In Proceedings of the 11th Annual Symposium on Computer Architecture (Ann Arbor, Mich., June 1984). pp. 355-363. | |||
[3] Censier, L. M., and Feautrier, P. A new solution to coherence problems in multicache systems. IEEE Trans. Comput. C-27,12 (Dec. 1978), 1112-1118. | |||
[4] Eggers, S. J., and Katz, R. H. A characterization of sharing in parallel programs and its applications to coherence protocol evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture (Honolulu, June 1988). |
Latest revision as of 04:40, 9 May 2011
Though the migration from uniprocessor system to multiprocessing systems is not new, the world of parallel computers is undergoing a continuous change. Parallel computers, which started as high-end super-computing systems for carrying out huge calculations, are now ubiquitous and are present in all mainstream architectures for servers, desktops, and embedded systems. In order to design parallel architectures to meet programmer's needs and expectations more closely, exciting and challenging changes exist. The three main areas which are being considered by scientists today are: cache coherence, memory consistency and synchronization.
This article discusses these three issues and how they can be solved efficiently to meet programmer's requirements. A related topic - TLB coherence - is also dealth with. The wiki supplement also addresses the challenges that Peterson's algorithm demonstrates.
Cache Coherence
Here, by cache, we mean CPU cache. These are the small memories on or close to the CPU can operate faster than the much larger main memory.Cache coherency refers to the consistency of data stored in local caches of a shared resource. The following scenario shows the problems arising from inconsistent data when clients maintain caches of a common memory resource:
Thus, as evident from the above example, multiple copies of a block can easily get inconsistent. Hence, caches are critical to modern high-speed processors. The following section discusses the different solutions used to solve this problem.
Cache Coherence Solutions
Cache coherence solutions are mainly classified as - software-based and hardware-based solutions.
Software-based solutions are further classified as:
- Compiler-based or with run-time system support
- With or without hardware assist
Hardware-based solutions can be differentiated as:
- Shared caches or Snoopy schemes or Directory-based schemes
- Write-through vs write-back protocols
- Update vs invalidation protocols
- Dirty-sharing vs. no-dirty-sharing protocols
The main concern in case of software-based solutions is - perfect information is needed at all times when memory aliasing and explicit parallelism are required.So, the focus is more on improving hardware-based solutions and they are more common. Studies have shown that different snoop-based cache coherency schemes have a strong sensitivity to the write-policy more than the specific coherency protocol. Write-back schemes are more efficient than despite the increased hardware complexity involved in cache-coherency support. [1]
Hardware-based cache-coherence protocols, though more competitive in terms of performance with respect to basic architectures with no hardware support, incur significant power cost as coherence traffic grows. Thus, as power constraints become tighter and the degree of multiprocessing increases, viability of hardware-based solutions becomes doubtful.
Cache Coherence Protocols
The two basic methods to utilize the inter-core bus to notify other cores when a core changes something in its cache are update and invalidate. In the update method, if variable 'x' is modified by core 1, core 1 has to send the updated value of 'x' onto the inter-core bus. Each cache listens to the inter-core bus and if a cache sees a variable on the bus which it has a copy of, it will read the updated value. This ensures that all caches have the most up-to-date value of the variable. [2]
In case of invalidation, an invalidation message is sent onto the inter-core bus when a variable is changed. The other caches will read this invalidation signal and if its core attempts to access that variable, it will result in a cache miss and the variable will be read from main memory.
The update method results in significant amount of traffic on the inter-core bus as the update signal is sent onto the bus every time the variable is updated. The invalidation method only requires that an invalidation signal be sent the first time a variable is altered; this is why the invalidation method is the preferred method.
In order to improve cache coherency performance over the years, the following protocols were proposed:
- MSI
MSI stands for Modified, Shared, and Invalid, based on the three states that a line of cache can be in. The Modified state means that a variable in the cache has been modified and therefore has a different value than that found in main memory; the cache is responsible for writing the variable back to main memory. The Shared state means that the variable exists in at least one cache and is not modified; the cache can evict the variable without writing it back to the main memory. The Invalid state means that the value of the variable has been modified by another cache and this value is invalid; the cache must read a new value from main memory (or another cache).
- MESI
MESI stands for Modified, Exclusive, Shared, and Invalid. The Modified and Invalid states are the same for this protocol as they are for the MSI protocol. This protocol introduces a new state; the Exclusive state. The Exclusive state means that the variable is in only this cache and the value of it matches the value within the main memory. This now means that the Shared state indicates that the variable is contained in more than one cache.
- MOSI
The MOSI protocol is identical to the MSI protocol except that it adds an Owned state. The Owned state means that the processor "Owns" the variable and will provide the current value to other caches when requested (or at least it will decide if it will provide it when asked). This is useful because another cache will not have to read the value from main memory and will receive it from the Owning cache much, much, faster.
- MOESI
The MOESI protocol is a combination of the MESI and MOSI protocols.
Translation Look-aside Buffer (TLB) Coherence
A TLB is a on-chip cache of the page table entries (PTEs) used to speed up virtual-to-physical-address translation. A PTE can reside in TLBs of multiple processors due to actual sharing or process migration. PTEs can be modified (during page swapping or changing its protection) leading to the direct analog of cache coherence problem.
A variety of solutions have been proposed to solve this problem based on whether hardware loads PTEs directly into the TLB or if PTEs are loaded into TLBs under software control and how several variables in TLBs and operating system are implemented. Software solutions (through operating system) are more widely used, as TLB coherence operations are less frequent than cache coherence operations. Hardware solutions are also used by few systems when TLB operations are not visible to software.
Some of the approaches for implementing TLB coherence are as listed below:
- Virtually addressed caches
- Software TLB shootdown
- Address space identifiers (ASIDs)
- Hardware TLB coherence
Memory Consistency
Memory consistency deals with the ordering of all memory operations - loads and stores - to different memory locations. This exists in systems without caches as well. The code below from Solihin shows the problem existing in case of memory inconsistency:
P0 P1 S1 : datum = 5 S3 : while(!datumIsReady) {} S2 : datumIsReady = 1 S4 : print datum
In this example, P0 generates a certain piece of data and then signals (by setting a certain memory location to 1) that this data is now ready. P1 is spinning in a while waiting for this data ready flag to become 1, and then it will print the data. It is important that the compiler understand what is going on so that the program order is preserved. In C, this can be accomplished by declaring certain variables as volatile.
Memory Consistency Models
The memory consistency model of a shared-memory multiprocessor is a formal specification of how the memory system appears to the programmer. It eliminates the gap between the behavior expected by the programmer and the actual behavior supported by a system. Effectively, the consistency model places restrictions on the values that can be returned by a read in a shared-memory program execution.
In a single processor system, in order to maintain memory consistency, it needs to ensure that the compiler preserves the program order when accessing synchronization variables. But in a multiprocessor system, it is required to ensure that accesses of one processor appear to execute in program order to all other processors, atleast partially.
Memory semantics in Uniprocessor systems
Uniprocessor languages use simple sequential semantics for memory operations, which allow the programmer to assume that all memory operations will occur one at a time in the sequential order specified by the program. Thus, the programmer can expect a read to return the value of the last write to the same location before it by the sequential program order. It is sufficient to only maintain uniprocessor data and control dependences. The compiler and hardware can freely reorder operations to different locations if the uniprocessor data and control dependences are respected. This enables compiler optimizations such as register allocation, code motion, and loop transformations, and hardware optimizations, such as pipelining, multiple issue, write buffer bypassing and forwarding, and lockup-free caches, all of which lead to overlapping and reordering of memory operations. [4]
Memory semantics in multiprocessor systems
Programmer's implicit expectations are:
- memory accesses in a processor takes place according to the program order.
- Each memory access is performed atomically.
A strong consistency model attempting uniprocessor-like consistency could cause global bottleneck, costing performance. Thus, weak consistency models are deployed to improve performance. The advanatges of such models are:
- They support out-of-order execution within individual CPUs
- Relaxes latency issues with near-simultaneous accesses by different CPUs
The following are the various consistency models and it is the programmer who must take into account the memory consistency model to create correct software:
- linearizability (also known as strict or atomic consistency)
- sequential consistency
- causal consistency
- release consistency
- eventual consistency
- delta consistency
- PRAM consistency (also known as FIFO consistency)
- weak consistency
- vector-field consistency
- fork consistency
- serializability
- one-copy serializability
- entry consistency
The memory coherence problem in a shared virtual memory system and in multicache systems are different. In a multicache multiprocessor, there are processors sharing a physical memory through their private caches. The relatively small size of a cache and the fast bus connection to the shared memory, enables using a sophisticated coherence protocol for the multicache hardware such that the time delay of conflicting writes to a memory location is small. [5]
In contrast, in a shared virtual memory on a loosely coupled multiprocessor which has no physically shared memory, and having a nontrivial communication cost between processors, conflicts are not likely to be solved with negligible delay, and they resemble much more a “page fault” in a traditional virtual memory system. Thus, there are two design choices that greatly influence the implementation of a shared virtual memory: the granularity of the memory units (i.e., the “page size”)and the strategy for maintaining coherence.
Memory coherence strategies are classified based on how they deal with page synchronization and page ownership. The algorithms for memory coherence depend on the page fault handlers, their servers and the data structures used. So page table becomes an important part of these protocols.
Synchronization
Process synchronization, which is the mainly concerned with different processes committing to a a certain sequence of actions, and data synchronization, which deals with maintaining data integrity across various copies of a dataset, are two two distinct but related concepts. Process synchronization primitives can be used to implement data synchronization.
Mutual exclusion is the main requirement to be fulfilled in order to synchronize processes, and is needed in both single-processor and multiprocessor systems. There are various approaches to provide mutual exclusion in a system:
- Disabling interrupts
- Locks
- Mutex
- Semaphores
- Barriers
- Test and Set
The next section discusses if there is an alternative to implement mutual exclusion without requiring any hardware support. Peterson's algorithm is one such software solution for guaranteeing mutual exclusion.
Peterson's algorithm
This algorithm utilizes the simple idea of combining per-thread flags to indicate the intent to enter the lock and the turn variable to determine which thread should enter in the rare case that both wish to enter at the same time. The algorithm is as shown below: [3]
void init() { flag[0] = flag[1] = 0; // 1 -> thread wants to acquire lock (intent) turn = 0; // whose turn is it? (thread 0 or thread 1?) } void lock() { flag[self] = 1; turn = 1 - self; // be generous: make it the other thread’s turn while ((flag[1-self] == 1) && (turn == 1 - self)) ; // spin-wait while other thread has intent // AND it is other thread’s turn } void unlock() { flag[self] = 0; // simply undo your intent }
The advantages of this approach are:
- It does not assume much about the underlying hardware. It relies on the fact that load and store instructions are atomic( even across processors) and they execute in an order.
- It does not require special instructions.
Though Peterson's algorithm has the above advantages, there are a few problems that render it impractical for use:
- Spin-waiting is possible which causes a thread to spend lot of its CPU time waiting for another thread to release the lock.
- Algorithm might not work well with out-of-order execution of instructions supported by most of the modern processors.
- The algorithm also lacks scalability.
Thus, only the software solutions cannot work well. Sufficient amount of hardware and OS support is needed to get the locking work properly.
Hardware support
Exclusive locking assumes the worst and proceeds only after acquiring all locks such that no other thread can interfere. This is a pessimistic approach. In contrast, the optimistic approach proceeds with an update, hoping that it can be completed without any interference. This requires collision detection during the update. The optimistic approach is thus, more efficient in fine-grained operations.
Special instructions are provided by processors designed for multiprocessor operations in order to manage concurrent access to shared variables. Atomic instructions like test-and-set, fetch-and-increment and swap were sufficient for early processors to implement mutexes for concurrent objects. Today, every modern processor relies on some form of read-modify-write atomic instruction such as compare-and-swap, LL/SC etc. for the same.
References
[1] Loghi Mirko,Massimo Poncino , and Luca Benini. "Cache Coherence Tradeoffs in Shared-memory MPSoCs." ACM Digital Library. Web. 18 Mar. 2011. <http://portal.acm.org/citation.cfm?id=1151081>.
[2] http://www.windowsnetworking.com/articles_tutorials/Cache-Coherency.html
[3] http://pages.cs.wisc.edu/~remzi/OSFEP/threads-locks.pdf
[4] Sarita V. Adve. Kourosh Gharachorloo. "Shared Memory. Consistency Models: A Tutorial." Digital Western Research Laboratory 250 University Avenue Palo Alto. <http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf>
[5] Kai Li, and Paul Hudak. "Memory coherence in shared virtual memory systems". Published in Journal ACM Transactions on Computer Systems (TOCS), Volume 7 Issue 4, Nov. 1989
See Also
[1] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Mass. 1974.
[2] Archibald, J., and Baer, J. An economical solution to the cache coherence problem. In Proceedings of the 11th Annual Symposium on Computer Architecture (Ann Arbor, Mich., June 1984). pp. 355-363.
[3] Censier, L. M., and Feautrier, P. A new solution to coherence problems in multicache systems. IEEE Trans. Comput. C-27,12 (Dec. 1978), 1112-1118.
[4] Eggers, S. J., and Katz, R. H. A characterization of sharing in parallel programs and its applications to coherence protocol evaluation. In Proceedings of the 15th Annual International Symposium on Computer Architecture (Honolulu, June 1988).