CSC/ECE 506 Spring 2010/ch 3 jb/Parallel Programming Model Support
Supplement to Chapter 3: Support for parallel-programming models. Discuss how DOACROSS, DOPIPE, DOALL, etc. are implemented in packages such as Posix threads, Intel Thread Building Blocks, OpenMP 2.0 and 3.0.
Posix threads
POSIX thread, also referred to as a pthread is used in shared address space architectures for parallel programs. Through the use of the pthread API, various functions can be used to create and manage pthreads. In order to fully, understand how pthreads can be used to exploit DOACROSS, DOPIPE, and DOALL parallelism a brief introduction to creating/terminating pthreads, mutexes, and conditional variables are necessary.
Creating and Terminating Pthreads
In order to create a pthread, the API provides the pthread_create() function. The pthread function accepts 4 arguments: thread, attr, start_routine, and arg. The thread argument is used to provide a unique identifier for the thread you are creating. The attr argument is used to specify a threads attribute object, or use default attributes by passing NULL. For the examples discussed the default attributes will be sufficient; for more information on setting thread attributes please see references. The start_routine argument is the program subroutine that will be executed by the thread being created. The arg argument is used to pass an argument to the subroutine that is being executed by the thread being created (value can be set to NULL if no argument is being pass to the subroutine).
In order to terminate pthreads, the API provides the pthread_exit() function. In order to terminate a pthread, the thread being run simply has to call this function (even the main thread). NOTE: there are alternate methods for terminating pthreads not discussed here for convenience and simplicity.
#include <pthread.h> #include <stdio.h> #define NUM_THREADS 5
Mutexes
A mutex variable is a variable that must only be accessed by a single thread (mutex is short for mutual exclusion). The API provides the pthread_mutex_t data type in order statically create a mutex variable, and the pthread_mutex_init() function to create it dynamically.
Mutex variables are used to implement locks, so that multiple pthreads do not access critical data in a program. The API provides the pthread_mutex_lock() and pthread_mutex_unlock() functions. These functions simply lock or unlock the mutex variable specified.
Conditional Variables
Conditional variables allow for point-to-point synchronization between threads. The API provides a few useful functions in order to synchronize threads: pthread_cond_wait() and pthread_cond_signal(). The pthread_cond_wait() blocks all threads until the specified condition is satisfied. The pthread_cond_signal() wakes up another thread that is waiting on the condition to be satisfied. These functions are the pthread specific functions that are analogous to the general wait() and post() function discussed in the Sohlihin text.
DOACROSS Parallelism
In order to exploit DOACROSS parallelism using pthreads, conditions are needed in order to synchronize the threads. Since instructions are executed across iterations, and data dependencies exist across iterations are assumed (See Sohlin Text), the conditional variables shown above are used to ensure the correct execution of the code.
Lets take a simple example where each thread calculates A[i] = A[i-1] + B[i]. Point-to-point synchronization is necessary in order to make sure A[i-1] is not read before its value is written. This is where the pthread’s conditional variable comes is useful. We put pthread_cond_wait() and pthread_cond_signal() around the instruction above, in order to make sure the previous thread has signaled that its completed before the current thread performs its own computation.
DOPIPE Parallelism
In order to exploit DOPIPE parallelism, conditions are also needed in order to synchronize threads. Instead of instructions being implemented across threads, instructions are implemented on a single thread (i.e. instruction 1 is executed by thread 1, instruction 2 is executed by thread 2, etc). However, there are loop independent data dependencies which require the conditional variables.
Here we use the pthreads differently from the DOACROSS parallelism. The DOPIPE parallelism has each thread call a different function. Each function may have some loop independent dependence with some other function. Lets say function 2 depends on function 1, so function 1 will call pthread_cond_signal() once it is finished, and function 2 will call pthread_cond_wait().
The differences between DOPIPE and DOACROSS are that DOPIPE executes different functions on each thread, and it has the signal and wait functions called from different functions. Whereas DOACROSS executes the same function on each thread (it just uses different data), and it has the signal and wait functions called from the same function.
DOALL Parallelism
Since DOALL parallelism just means that all iterations are executed in parallel, and no dependences exits. All that is necessary is that the threads have to be created.
Intel Threading Building Blocks
According the to Intel® Software site, Intel® Threading Building Blocks (Intel® TBB) is a C++ template library that abstracts thread to tasks to create reliable, portable, and scalable parallel applications.
Parallel Loops
A DOALL parallel construct can be specified using the parallel_for() construct. This construct takes two parameters. The first is the range of indices of the loop that can be run in parallel. The second is the a solid chunk of operations that can be processed as a units that are safe to run concurrently. For a DOALL loop, this second parameter should include all possible loop indices. Also, an optional third parameter can be specified to define the chunk size of the loop and information about cache affinity.
OpenMP 2.0
What is OpenMP? OpenMP or Open Multi-processing is a multi-platform API used for shared addressed space programming. There are versions of OpenMP for both C/C++ and Fortran. The OpenMP libraries provide a list of compiler directives that easily allow for one to write shared memory parallel programs. Before explaining how to exploit the different types of parallelism supported by OpenMP 2.0 it is necessary to understand how to create threads.
Parallel Region
In OpenMP 2.0 in order to create threads, the C/C++ compiler directive used is: #pragma omp parallel and is enclosed by curly brackets (See reference for Fortran directive). This directive means the start of a parallel region, and at the start of a parallel threads are created. In order to specify the number of threads that are created in a parallel region the function omp_set_num_threads(int n) is called.
Inside a parallel region different compiler directive can be used to exploit different types of parallelism. The two types of parallelism discussed below are DOALL and function parallelism. With OpenMP 2.0 DOACROSS and DOPIPE parallelism cannot be expressed through the use of compiler directives.
DOALL parallelism
OpenMP exploits DOALL parallelism through the use of a simple directive that tells the compiler to execute a section of code on multiple threads. The C/C++ directive is: #pragma omp parallel for. (See reference for Fortran Directive). This directive if placed before a normal sequential for loop in C/C++ will execute all iterations of the for loop in parallel.
Function parallelism
OpenMP 2.0 can also exploit function parallelism with compiler directives. The C/C++ directive is: #pragma omp section (See reference for Fortran Directive). This directive is placed before a section of code that is to be executed by a single thread. For function parallelism, multiple data independent code blocks can each be placed inside a parallel region with the section compiler directive placed before each block of code.
OpenMP 3.0
In May 2008, OpenMP version 3.0 was released as an upgrade with some features such as tasks, synchronization primitives among others. For the scope of this article, nothing significant was upgraded regarding the execution of DOALL, DOACROSS, and DOPIPE parallelism, reduction, or function parallelism. Therefore, see sections above about OpenMP 2.0 since all discussions there are applicable to version 3.0.
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- Intel® Corporation, "Intel® TBB - Intel® Software Network", http://software.intel.com/en-us/intel-tbb/
- Intel® Corporation, "Intel® Threading Building Blocks 2.2 for Open Source", http://www.threadingbuildingblocks.org/
- Mark Bull, University of Edinburgh, "OpenMP 3.0 Overview", http://www.compunity.org/futures/Mark_SC06BOF.pdf