CSC/ECE 506 Fall 2007/wiki 11 e4: Difference between revisions
(20 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
== '''Sections 1.3.1 and 1.3.2: Communication and programming model''' == | |||
== | |||
== Ordering == | |||
In parallel programming models, ordering is key to coordinating the activity of all threads. Ordering will ensure that dependencies are maintained and that specified threads remain synchronized via explicit synchronization operations. | |||
Parallel Programming models have developed to take advantage of shared memory multiprocessors and distributed memory systems. Nanothreads Programming Model (NPM) is one such programming model, and exploits multiple levels of loop and functional parallelism. This model results in a runtime environment that has dynamic program adaptability, allowing the program to adjust the granularity of the generated parallelism to the available resources. The main ordering and scheduling objective of this model is that application scheduling, at the user level, and virtual process scheduling, at the kernel level, be tightly coordinated. The result is high performance. | |||
Other programming models combine loop directives and message passing to maintain order and facilitate reordering. An example of a program using this method is the GeoFEM on the Earth Simulator Supercomputer in Japan. This is part of a parallel finite-element platform for solid earth simulation being studied. A special reordering technique was used to create parallel iterative solvers with localized preconditioning. This allows GeoFEM to attain concurrent local operations, no global dependency, continuous memory access, and sufficiently long innermost loops to take advantage of the vector processors on Earth Simulator. | |||
The programming model examples above employ different reordering techniques for processes and data. These techniques are listed and explained below. | |||
* Implicit ordering – Operations in a thread are in program order. When multiple threads access the same data, there is no guarantee that they will not reach the data too late/too soon. | |||
* Explicit Reordering – Programs avoid overhead by explicitly assigning and ordering processes to take advantage of data locality. This approach works well when there is complicated data dependency. | |||
* Reverse Cuthill-McKee (RCM) Reordering - RCM is a typical level set reordering method. Elements of a level set are traversed from the nodes of the highest degree to those of the lowest degree, according to dependency relationships. Degree refers to the number of nodes connected to each node. | |||
* DJDS reordering- Reorders data to produce arrays of coefficients with continuous memory access. Vector programming is made more efficient by providing sufficient length for innermost loops. Descending-order jagged diagonal storage (DJDS) involves permuting rows into an order of decreasing number of non-zeros. Can be modified to split and permute arrays to be distributed across an SMP node (parallel DJDS or PDJDS). [3] provides a good explanation and comparison of RCM/PDJDS re-ordering techniques. | |||
== Synchronization == | |||
In order to exploit parallelism a task must be broken up into several parts. These smaller tasks can then be spread out among many computational resources. These smaller tasks must however co-operate to produce the final desired data while fully exploiting the array of computational resources that they are spread out over. These resources can be processors as in the case of multi-threading or even collections of personal computers as in the case of distributed computing. Failure for these tasks to co-operate may result in data inconsistency. In this section we discuss different synchronization tools and then go on to discuss how the problem of synchronization is being tackled now and into the future. | |||
There is a small collection of tools that are commonly used for synchronization in parallel architectures. These include: | |||
* Locks - Locks limit access to shared resources by identifying each resource as being available or unavailable. Successful acquisition of a lock means that a task now has the right to use the protected resource. The simplest implementation of a lock is a binary semaphore. A semaphore is a flag that signals the presence of a specific condition in your task. Semaphores offer a way to synchronize access to shared resources in a multiprogramming environment. | |||
* Mutual Exclusion – In this case, certain operations on certain data are performed by only one task at a time. The tasks are sequential, but without a specific order or events. These operations, when implemented in computer code are called critical regions. | |||
* Barriers - In this case a group of tasks must stop at a certain point and cannot proceed past a this point until other tasks have reached this point. This synchronization point is called a barrier. | |||
* Events – This type of synchronization involves specific events allowing other processes to start. An event may trigger a single process or a group of processes. | |||
New mechanisms, like that of a monotonic counter (see references), only build on existing process synchronization constructs. Most of the effort and research seems to be going in the way of building threading libraries that will allow software developers to build scalable applications that can optimally utilize massively parallel architectures. The goal of these libraries is to reduce the amount of time spent on dealing with the low-level without compromising on the performance of the applications. Intel, for example, has gone public with an open source threading building block library with promises that it will become the STL(Standard Template Library) for multi-threaded applications. These libraries allow rapid application development of new applications while hiding the low-level thread synchronization mechanics. A few years ago, distributed computing was the talk of the town but with the movement of PC architecture toward multiple cores per processor, these threading libraries will have a big role to play in getting the best performance out of these new architectures. | |||
== References: == | |||
1. Hadjidoukas, P. et all, Integrating MPI and the Nanothreads Programming Model, http://citeseer.ist.psu.edu/cache/papers/cs/27581/http:zSzzSzwww.hpclab.ceid.upatras.grzSzpgroupzSzmemberszSzpehzSzpubszSzpdp02.pdf/hadjidoukas02integrating.pdf, 2002. | |||
2. Isard, M. et al, Automatic Mutual Exclusion, http://research.microsoft.com/users/misard/papers/hotos2007.pdf, 2007. | |||
3. Nakajima, K. et al, Parallel Iterative Solvers for Finite-Element Methods using a Hybrid Programming Model on SMP Cluster Architectures, http://geofem.tokyo.rist.or.jp/report_common/GeoFEM03_003.pdf, March 2003. | |||
4. John Thornley, K. Mani Chandy, "Monotonic Counters: A New Mechanism for Thread Synchronization," ipdps, p. 573, 14th International Parallel and Distributed Processing Symposium (IPDPS'00), 2000 | |||
5. Intel Threading Building Blocks 2.0, http://www.intel.com/cd/software/products/asmo-na/eng/294797.htm?cid=cim:ggl|spd_us_tbb|k7A3E|s | |||
6. Apple Threading Libraries, http://developer.apple.com/documentation/Cocoa/Conceptual/Multithreading/articles/ThreadLocking.html | |||
7. Silberschatz, Galvin, Gagne, "Operating System Concepts", Sixth Edition | |||
== External Links: == | |||
* [http://www.intel.com/cd/ids/developer/asmo-na/eng/segments/hpc/95223.htm?page=12 Trends in Distributed Computing] | |||
* [http://www.ddj.com/hpc-high-performance-computing/199100439;jsessionid=4WBRUFR1BVGECQSNDLPCKH0CJUNN2JVN DDJ: What's different about multiprocessor programming] | |||
* [http://en.wikipedia.org/wiki/Synchronization_%28computer_science%29 Wikipedia: Synchronization] | |||
* [http://users.actcom.co.il/~choo/lupg/tutorials/parallel-programming-theory/parallel-programming-theory.html Parallel Programming - Basic Theory for the unwary] |
Latest revision as of 03:24, 11 September 2007
Sections 1.3.1 and 1.3.2: Communication and programming model
Ordering
In parallel programming models, ordering is key to coordinating the activity of all threads. Ordering will ensure that dependencies are maintained and that specified threads remain synchronized via explicit synchronization operations.
Parallel Programming models have developed to take advantage of shared memory multiprocessors and distributed memory systems. Nanothreads Programming Model (NPM) is one such programming model, and exploits multiple levels of loop and functional parallelism. This model results in a runtime environment that has dynamic program adaptability, allowing the program to adjust the granularity of the generated parallelism to the available resources. The main ordering and scheduling objective of this model is that application scheduling, at the user level, and virtual process scheduling, at the kernel level, be tightly coordinated. The result is high performance.
Other programming models combine loop directives and message passing to maintain order and facilitate reordering. An example of a program using this method is the GeoFEM on the Earth Simulator Supercomputer in Japan. This is part of a parallel finite-element platform for solid earth simulation being studied. A special reordering technique was used to create parallel iterative solvers with localized preconditioning. This allows GeoFEM to attain concurrent local operations, no global dependency, continuous memory access, and sufficiently long innermost loops to take advantage of the vector processors on Earth Simulator.
The programming model examples above employ different reordering techniques for processes and data. These techniques are listed and explained below.
- Implicit ordering – Operations in a thread are in program order. When multiple threads access the same data, there is no guarantee that they will not reach the data too late/too soon.
- Explicit Reordering – Programs avoid overhead by explicitly assigning and ordering processes to take advantage of data locality. This approach works well when there is complicated data dependency.
- Reverse Cuthill-McKee (RCM) Reordering - RCM is a typical level set reordering method. Elements of a level set are traversed from the nodes of the highest degree to those of the lowest degree, according to dependency relationships. Degree refers to the number of nodes connected to each node.
- DJDS reordering- Reorders data to produce arrays of coefficients with continuous memory access. Vector programming is made more efficient by providing sufficient length for innermost loops. Descending-order jagged diagonal storage (DJDS) involves permuting rows into an order of decreasing number of non-zeros. Can be modified to split and permute arrays to be distributed across an SMP node (parallel DJDS or PDJDS). [3] provides a good explanation and comparison of RCM/PDJDS re-ordering techniques.
Synchronization
In order to exploit parallelism a task must be broken up into several parts. These smaller tasks can then be spread out among many computational resources. These smaller tasks must however co-operate to produce the final desired data while fully exploiting the array of computational resources that they are spread out over. These resources can be processors as in the case of multi-threading or even collections of personal computers as in the case of distributed computing. Failure for these tasks to co-operate may result in data inconsistency. In this section we discuss different synchronization tools and then go on to discuss how the problem of synchronization is being tackled now and into the future.
There is a small collection of tools that are commonly used for synchronization in parallel architectures. These include:
- Locks - Locks limit access to shared resources by identifying each resource as being available or unavailable. Successful acquisition of a lock means that a task now has the right to use the protected resource. The simplest implementation of a lock is a binary semaphore. A semaphore is a flag that signals the presence of a specific condition in your task. Semaphores offer a way to synchronize access to shared resources in a multiprogramming environment.
- Mutual Exclusion – In this case, certain operations on certain data are performed by only one task at a time. The tasks are sequential, but without a specific order or events. These operations, when implemented in computer code are called critical regions.
- Barriers - In this case a group of tasks must stop at a certain point and cannot proceed past a this point until other tasks have reached this point. This synchronization point is called a barrier.
- Events – This type of synchronization involves specific events allowing other processes to start. An event may trigger a single process or a group of processes.
New mechanisms, like that of a monotonic counter (see references), only build on existing process synchronization constructs. Most of the effort and research seems to be going in the way of building threading libraries that will allow software developers to build scalable applications that can optimally utilize massively parallel architectures. The goal of these libraries is to reduce the amount of time spent on dealing with the low-level without compromising on the performance of the applications. Intel, for example, has gone public with an open source threading building block library with promises that it will become the STL(Standard Template Library) for multi-threaded applications. These libraries allow rapid application development of new applications while hiding the low-level thread synchronization mechanics. A few years ago, distributed computing was the talk of the town but with the movement of PC architecture toward multiple cores per processor, these threading libraries will have a big role to play in getting the best performance out of these new architectures.
References:
1. Hadjidoukas, P. et all, Integrating MPI and the Nanothreads Programming Model, http://citeseer.ist.psu.edu/cache/papers/cs/27581/http:zSzzSzwww.hpclab.ceid.upatras.grzSzpgroupzSzmemberszSzpehzSzpubszSzpdp02.pdf/hadjidoukas02integrating.pdf, 2002.
2. Isard, M. et al, Automatic Mutual Exclusion, http://research.microsoft.com/users/misard/papers/hotos2007.pdf, 2007.
3. Nakajima, K. et al, Parallel Iterative Solvers for Finite-Element Methods using a Hybrid Programming Model on SMP Cluster Architectures, http://geofem.tokyo.rist.or.jp/report_common/GeoFEM03_003.pdf, March 2003.
4. John Thornley, K. Mani Chandy, "Monotonic Counters: A New Mechanism for Thread Synchronization," ipdps, p. 573, 14th International Parallel and Distributed Processing Symposium (IPDPS'00), 2000
5. Intel Threading Building Blocks 2.0, http://www.intel.com/cd/software/products/asmo-na/eng/294797.htm?cid=cim:ggl%7Cspd_us_tbb%7Ck7A3E%7Cs
6. Apple Threading Libraries, http://developer.apple.com/documentation/Cocoa/Conceptual/Multithreading/articles/ThreadLocking.html
7. Silberschatz, Galvin, Gagne, "Operating System Concepts", Sixth Edition