CSC/ECE 506 Fall 2007/wiki 11 e4

From Expertiza_Wiki
Revision as of 16:42, 10 September 2007 by Sbhanna (talk | contribs) (→‎Reordering)
Jump to navigation Jump to search

Sections 1.3.1 and 1.3.2: Communication and programming model.

Ordering

In parallel programming models, ordering is key to coordinating the activity of all threads. Ordering will ensure that dependencies are maintained and that specified threads remain synchronized via explicit synchronization operations.

Parallel Programming models have developed to take advantage of shared memory multiprocessors and distributed memory systems. Nanothreads Programming Model (NPM) is one such programming model that exploits multiple levels of loop and functional parallelism. This model results in a runtime environment that has dynamic program adaptability, allowing the program to adjust the granularity of the generated parallelism to the available resources. The main ordering and scheduling objective of this model is that application scheduling, at the user level, and virtual process scheduling, at the kernel level, be tightly coordinated. The result is high performance.

Other programming models combine loop directives and message passing to maintain order and facilitate reordering. An example of a program using this method is the GeoFEM on the Earth Simulator Supercomputer in Japan. This is part of a parallel finite-element platform for solid earth simulation being studied. A special reordering technique was used to create parallel iterative solvers with localized preconditioning. This allows GeoFEM to attain concurrent local operations, no global dependancy, continuous memory access, and sufficiently long innermost loops to take advantage of the vector processors on Earth Simulator.

Implicit ordering – Operations in a thread are in program order. When multiple threads access the same data, there is no guarantee that they will not reach the data too late/too soon.

Mutual Exclusion – Certain operations on certain data are performed by only one thread/process at a time. The processes are sequential, but without a specific order or events.

Automatic Mutual Exclusion - In a simplified model, Automatic Mutual Exclusion consists of asynchronous method calls and guarantees that the program execution is essentially executing each of the calls in some serialized order. It achieves concurrency because the async construct is similar to forking a thread in thread-like systems. The system attempts to execute the method calls concurrently and is subject to strategies that prevent excessive transaction aborts. In more complicated models, the asynchronous methods may be fragmented, using a Yield function to synchronize the fragments prior to blocks.

Events – Specific events allow other processes to start. An event may trigger a single process or a group of processes.

Explicit Reordering – Programs avoid overhead by explicitly assigning and ordering processes to take advantage of data locality. This approach works well when there is complicated data dependency.

Reverse Cuthill-McKee (RCM) Reordering - RCM is a typical level set reordering method. Elements of a level set are traversed from the nodes of the highest degree to those of the lowest degree, according to dependency relationships. Degree refers to the number of nodes connected to each node.

DJDS reordering- Reorders data to produce arrays of coefficients with continuous memory access. Vector programming is made more efficient by providing sufficient length for innermost loops. Descending-order jagged diagonal storage (DJDS) involves permuting rows into an order of decreasing number of non-zeros. Can be modified to split and permute arrays to be distributed across an SMP node (parallel DJDS or PDJDS) [3] provides a good explanation and comparison of RCM/PDJDS re-ordering techniques.

References:

1. Hadjidoukas, P. et all, Integrating MPI and the Nanothreads Programming Model, http://citeseer.ist.psu.edu/cache/papers/cs/27581/http:zSzzSzwww.hpclab.ceid.upatras.grzSzpgroupzSzmemberszSzpehzSzpubszSzpdp02.pdf/hadjidoukas02integrating.pdf, 2002.

2. Isard, M. et al, Automatic Mutual Exclusion, http://research.microsoft.com/users/misard/papers/hotos2007.pdf, 2007.

3. Nakajima, K. et al, Parallel Iterative Solvers for Finite-Element Methods using a Hybrid Programming Model on SMP Cluster Architectures, http://geofem.tokyo.rist.or.jp/report_common/GeoFEM03_003.pdf, March 2003.

Synchronization

There is a small collection of tools that are commonly used for synchronization in parallel architectures. These include:

Locks - Locks limit access to shared resources by identifying each resource as being available or unavailable. Successful acquisition of a lock means that your code now has the right to use the protected resource.

Semaphores - A semaphore is a flag that signals the presence of a specific condition in your program. Semaphores offer a way to synchronize the execution of code.

Critical regions" - A critical region is a section of code that can be executed by only one thread at a time. You might designate a block of code as a critical region to prevent modification of a common data structure or to prevent race conditions as two threads execute the same code.

Barriers - Deadlock occurs whenever a thread is blocked waiting on a resource of another thread that will never become available.

Atomic operations - The x86 bit test and set is an example of Instruction Set Architecture support for synchronization. They simplify synchronization by eliminating the need to locks altogether.

This list is short, has not changed much in the recent past and is unlikely to change any time soon. Any new mechanisms like that of a monotonic counter(see references) only build on existing process synchronization constructs. Most of the effort and research seems to be going in the way of building threading libraries that will allow software developers to build scalable applications that can optimally utilize massively parallel architectures. These goal of these libraries is to reduce the amount of time spent on dealing with the low-level without compromising on the performance of the applications. Intel, for example, has gone public with an open source threading building block library with promises that it will become the STL(Standard Template Library) for multi-threaded applications. These libraries allow rapid application development of new applications while hiding the low-level thread synchronization mechanics. A few years ago, distributed computing was the talk of the town but with the movement of PC architecture toward multiple cores per processor, these threading libraries will have a big role to play in getting the best performance out of these new architectures.

References

1. John Thornley, K. Mani Chandy, "Monotonic Counters: A New Mechanism for Thread Synchronization," ipdps, p. 573, 14th International Parallel and Distributed Processing Symposium (IPDPS'00), 2000

2. Intel Threading Building Blocks 2.0, http://www.intel.com/cd/software/products/asmo-na/eng/294797.htm?cid=cim:ggl%7Cspd_us_tbb%7Ck7A3E%7Cs

3. Apple Threading Libraries, http://developer.apple.com/documentation/Cocoa/Conceptual/Multithreading/articles/ThreadLocking.html