CSC/ECE 506 Spring 2010/ch 7 pl/

From Expertiza_Wiki
Revision as of 00:15, 26 March 2010 by Pwlane (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Shared Memory multiprocessors run into several problems that are more pronounced then on their uniprocessors counterparts. The Solihin text used in this course goes into a small amount of detail on three of these issues, that is Cache Coherence, Memory Consistency and Synchronization. It is the goal of this wiki to discuss these three issues and also what can be done to ensure that instructions are handled in both a timely and efficient manner and in a manner that is consistent with what the programmer might desire.

Cache Coherence

This problem arises when each chip in the system has it's own separate and discrete cache. The Solihin text gives a good example on how a problem can arise in a multiprocessor when there is no method to ensure cache coherence. In the example given in the text, the code below is executed.

{{{ sum = 0;

  1. pragma omp parallel for

for (i=0; i<2; i++) { #pragma omp critical { sum = sum + a[i]; } } print sum; }}}


The problem that will arise as discussed in the text is that P0 will calculate a value for sum, and store it in it's cache, and since P1 is not aware of the contents of P0's cache, it will read an invalid value for sum from main memory. It will be impossible to obtain the actual value for sum since each processor now believes that it possess the correct value for sum. That is, since they both read the value of sum from main memory to be 0, P0 thinks that the value of sum to be sum + a[0] while P1 thinks the value of sum to be sum + a[1]. It is important to note that placing the summation in a critical section does not fix this problem, since the problem is not that the 2 processors are executing at the same time, but that the main memory is not being updated and that the processors are not aware that the correct value for sum is now present (and only present) in another processors cache.

Memory Consistency

Another important issue in multiprocessor systems is the ordering of loads and stores into memory. The Solihin text gives an example of a signal-wait synchronization. In this example, P0 generates a certain piece of data and then signals (by setting a certain memory location to 1) that this data is now ready. P1 is spinning in a while waiting for this data ready flag to become 1, and then it will print the data. It is important that the complier understand what is going on so that the program order is preserved. In C, this can be accomplished by declaring certain variables as volatile.

Ordering on a Uniprocessor

On a uniprocessor system, memory ordering can be changed by the compiler quite a bit to ensure the best possible performance. The 2008 blog post on flickeringtubelight.net goes into some details about ordering on a uniprocessor vs. a multiprocessor. For example, reordering memory operations to different addresses is fine, and will cause no harm on a uniprocessor system. Some examples of OK changes to ordering are:

(load A, load B) > (load B, load A) (store B, store A) > (store A, store B)

These are OK since we on a uniprocessor system we can be certain that there is not another processor waiting for this information or "could observe this order and infer anything from the order"

Although it appears that we can freely change the order of memory operations on a uniprocessor, even on a uniprocessor we cannot change the order of operations to the SAME address. For example (load A, store A) > (store A, load A) would not be OK.

Ordering on a Multiprocessor

On a multiprocessor much more care must be taken to ensure that all of the loads and stores are committed to memory in a valid order. According to the blog referred to above the following must take place on a multiprocessor system:

(store A, store B): maintained in order. (load A, store B): load must be done first, ignoring a younger store in the queue. (store A, load B): load must be not go ahead of the store, even though the store is to a different address. (load A, load B): even if the addresses are different, the older load must go first.

Synchronization

The Solihin text uses the #pragma omp critical directive in the code given in the cache coherency section above. In order to use this directive the hardware must have some provision to ensure that only this one thread is going to be accessing the critical section at a given time. According to the text, the problem not only exists on multiprocessor systems but on uniprocessors as well. The problem discussed in the text is based on the implementation of a lock function.

The problem is that with out special directives, there is no way to ensure that only 1 processor obtains that lock, and in turn that there is only one processing element inside the critical section. The text suggests that the one good way to remedy this problem is to use an atomic instruction for the locking function. With an atomic function, all commands must execute successfully or else it will appear that none of the commands were executed. This allows for one processor to obtain the lock and enter the critical section, while the other processor(s) wait until the lock is released before entering the critical section.

There are many other ways to implement synchronization directives as well. The ACM article discusses using a test and set method for synchronization as well as a direct interrupt to another core. As it states, both of these are special hardware solutions that must be implemented in order for the programmer to be able to take advantage of them.

References

  • Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.