CSC/ECE 506 Spring 2010/ch 7 pl/
Shared-memory multiprocessors run into several problems that are more pronounced than their uniprocessor counterparts. The Solihin text used in this course goes into detail on three of these issues, that is cache coherence, memory consistency and synchronization. It is the goal of this wiki supplement to discuss these three issues and also what can be done to ensure that instructions are handled in both a timely and efficient manner and in a manner that is consistent with what the programmer might desire.
Cache Coherence
This problem arises when each chip in the system has its own separate and discrete cache. The Solihin text gives a good example on how a problem can arise in a multiprocessor when there is no method to ensure cache coherence. In the example given in the text, the code below is executed.
{{{ sum = 0;
- pragma omp parallel for
for (i=0; i<2; i++) { #pragma omp critical { sum = sum + a[i]; } } print sum; }}}
The problem that will arise as discussed in the text is that P0 will calculate a value for sum, and store it in its cache, and since P1 is not aware of the contents of P0's cache, it will read an invalid value for sum from main memory. It will be impossible to obtain the actual value for sum since each processor now believes that it possess the correct value for sum. That is, since they both read the value of sum from main memory to be 0, P0 thinks that the value of sum to be sum + a[0] while P1 thinks the value of sum to be sum + a[1]. It is important to note that placing the summation in a critical section does not fix this problem, since the problem is not that the 2 processors are executing at the same time, but that the main memory is not being updated and that the processors are not aware that the correct value for sum is now present (and only present) in another processors cache.
Memory Consistency
Another important issue in multiprocessor systems is the ordering of loads and stores into memory. The Solihin text gives an example of a signal-wait synchronization. In this example, P0 generates a certain piece of data and then signals (by setting a certain memory location to 1) that this data is now ready. P1 is spinning in a while waiting for this data ready flag to become 1, and then it will print the data. It is important that the complier understand what is going on so that the program order is preserved. In C, this can be accomplished by declaring certain variables as volatile.
Ordering on a Uniprocessor
On a uniprocessor system, memory ordering can be changed by the compiler quite a bit to ensure the best possible performance. The 2008 blog post on flickeringtubelight.net [1] goes into some details about ordering on a uniprocessor vs. a multiprocessor. For example, reordering memory operations to different addresses is fine, and will cause no harm on a uniprocessor system. Some examples of OK changes to ordering are:
(load A, load B) > (load B, load A) (store B, store A) > (store A, store B)
These are OK since we on a uniprocessor system we can be certain that there is not another processor waiting for this information or "could observe this order and infer anything from the order"
Although it appears that we can freely change the order of memory operations on a uniprocessor, even on a uniprocessor we cannot change the order of operations to the same address. For example (load A, store A) --> (store A, load A) would not be OK.
Ordering on a Multiprocessor
On a multiprocessor much more care must be taken to ensure that all of the loads and stores are committed to memory in a valid order. According to the blog referred to above the following must take place on a multiprocessor system:
(store A, store B): maintained in order. (load A, store B): load must be done first, ignoring a younger store in the queue. (store A, load B): load must be not go ahead of the store, even though the store is to a different address. (load A, load B): even if the addresses are different, the older load must go first.
Synchronization
Open MP
The Solihin text uses the "#pragma omp" critical directive in the code given in the cache-coherence section above. In order to use this directive the hardware must have some provision to ensure that only this one thread is going to be accessing the critical section at a given time. According to the text, the problem not only exists on multiprocessor systems but on uniprocessors as well. The problem discussed in the text is based on the implementation of a lock function.
The problem is that with out special directives, there is no way to ensure that only 1 processor obtains that lock, and in turn that there is only one processing element inside the critical section. The text suggests that the one good way to remedy this problem is to use an atomic instruction for the locking function. With an atomic function, all commands must execute successfully or else it will appear that none of the commands were executed. This allows for one processor to obtain the lock and enter the critical section, while the other processor(s) wait until the lock is released before entering the critical section.
Fence Insertion
One method that is commonly used on production shared memory systems and microprocessors it the insertion of fences. These fences can be used by the programmer to force the chip to perform a synchronization, or wait until "all previous memory operations of the processor are guaranteed to have completed" [2] Fence insertion is a method giving the programmer (or compiler) control over the hardware's default ordering of memory operations (such as sequential consistency, processor consistency, weak ordering, etc). It is important from a performance standpoint that the complier be able to quickly and efficiently insert fences automatically, and the algorithms used for this automatic insertion have a large bearing on performance. [3]
Instructions used for fence operations
- SPARC V8 > store barrier
- SPARC V9 > MEMBAR
- Alpha > memory barrier and write memory barrier
- MIPS / PPC > sync
- Intel x86 > lfence (load) sfence (store)
Taken from [4]
Other Methods
There are many other ways to implement synchronization directives as well. The ACM article [5] discusses using a test and set method for synchronization as well as a direct interrupt to another core. As it states, both of these are special hardware solutions that must be implemented in order for the programmer to be able to take advantage of them. The programmer (or complier) is responsible for knowing which synchronization directives are available on a given architecture and implementing them in an efficient manner.
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- Notes on Memory Consistancy and Cache Coherence http://flickeringtubelight.net/blog/wp-content/uploads/2008/06/notesonconsistencyandcoherence.pdf
- Mirko Loghi, Massimo Poncino, Luca Benini, Cache Coherence Tradeoffs in Shared-Memory MPSoCs https://wiki.ittc.ku.edu/ittc/images/0/0f/Loghi.pdf
- Automatic fence insertion for shared memory multiprocessing http://portal.acm.org/citation.cfm?id=782854&dl=GUIDE&coll=GUIDE&CFID=84866326&CFTOKEN=84791790