CSC/ECE 506 Spring 2010/ch 3 jb/Parallel Programming Model Support
Parallel programming reduces execution time versus sequential programming by taking advantage of code structure. In practice, there are various C/C++ code libraries that offer parallel programming support without needing to learn a new language or programming model. These include Posix threads, Intel® Threading Building Blocks, and OpenMP. The focus of this article is to discuss the implementation of these libraries for parallel programming models related to loop structure, specifically DOALL, DOACROSS and DOPIPE parallelism, reduction, and functional parallelism.
Posix threads
POSIX thread, also referred to as a pthread is used in shared address space architectures for parallel programs. Through the use of the pthread API, various functions can be used to create and manage pthreads. In order to fully, understand how pthreads can be used to exploit DOACROSS, DOPIPE, and DOALL parallelism a brief introduction to creating/terminating pthreads, mutexes, and conditional variables are necessary.
Creating and Terminating Pthreads
In order to create a pthread, the API provides the pthread_create() function. The pthread function accepts 4 arguments: thread, attr, start_routine, and arg. The thread argument is used to provide a unique identifier for the thread you are creating. The attr argument is used to specify a threads attribute object, or use default attributes by passing NULL. For the examples discussed the default attributes will be sufficient; for more information on setting thread attributes please see references. The start_routine argument is the program subroutine that will be executed by the thread being created. The arg argument is used to pass an argument to the subroutine that is being executed by the thread being created (value can be set to NULL if no argument is being pass to the subroutine).
In order to terminate pthreads, the API provides the pthread_exit() function. In order to terminate a pthread, the thread being run simply has to call this function (even the main thread). NOTE: there are alternate methods for terminating pthreads not discussed here for convenience and simplicity.
#include <pthread.h> #include <stdio.h> #define NUM_THREADS 5 void *Foo(void *threadid) { long tid; tid = (long)threadid; .... .... pthread_exit(NULL); } int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, Foo, (void *)t); // create pthread if (rc){ printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }
The code above shows a very simple example of how to create threads using pthreads. Notice the arguments passed to the pthread_create() function, and note their syntax. For completeness, the example code above shows a simple function being run by every thread.
Mutexes
A mutex variable is a variable that must only be accessed by a single thread (mutex is short for mutual exclusion). The API provides the pthread_mutex_t data type in order statically create a mutex variable, and the pthread_mutex_init() function to create it dynamically.
Mutex variables are used to implement locks, so that multiple pthreads do not access critical data in a program. The API provides the pthread_mutex_lock() and pthread_mutex_unlock() functions. These functions simply lock or unlock the mutex variable specified.
Conditional Variables
Conditional variables allow for point-to-point synchronization between threads. The API provides a few useful functions in order to synchronize threads: pthread_cond_wait() and pthread_cond_signal(). The pthread_cond_wait() blocks all threads until the specified condition is satisfied. The pthread_cond_signal() wakes up another thread that is waiting on the condition to be satisfied. These functions are the pthread specific functions that are analogous to the general wait() and post() function discussed in the Sohlihin text.
DOACROSS Parallelism
In order to exploit DOACROSS parallelism using pthreads, conditions are needed in order to synchronize the threads. Since instructions are executed across iterations, and data dependencies exist across iterations are assumed (See Sohlin Text), the conditional variables shown above are used to ensure the correct execution of the code.
Lets take a simple example where each thread calculates A[i] = A[i-1] + B[i]. Point-to-point synchronization is necessary in order to make sure A[i-1] is not read before its value is written. This is where the pthread’s conditional variable comes is useful. We put pthread_cond_wait() and pthread_cond_signal() around the instruction above, in order to make sure the previous thread has signaled that its completed before the current thread performs its own computation.
#include <pthread.h> #include <stdio.h> #define NUM_THREADS 50 #define NUM_ELEMENTS 50 typedef struct { double a[NUM_ELEMENTS]; double b[NUM_ELEMENTS]; long threadid; } DATA; pthread_mutex_t mutexvar; //mutex variable required for conditional wait and signal functions pthread_cond_t condvar; //conditional variable required for conditional wait and signal functions void *Foo(void *data) { long tid; tid = (long) data->threadid; pthread_mutex_lock(&mutexvar); pthread_cond_wait(&condvar, &mutexvar); //wait until safe to continue data->a[tid] = (double) data->a[tid - 1] + (double) data->b[tid]; pthread_cond_signal(&condvar); //signal that its safe to continue pthread_mutex_unlock(&mutexvar); pthread_exit(NULL); } int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; DATA data; //ASSUME data struct is initialized //Initialized Mutex and Conditional Variable pthread_mutex_init(&mutexvar, NULL); pthread_cond_init (&condvar, NULL); for(t=0; t<NUM_THREADS; t++){ printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[t], NULL, Foo, (void *)data); // create pthread if (rc){ printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }
DOPIPE Parallelism
In order to exploit DOPIPE parallelism, conditions are also needed in order to synchronize threads. Instead of instructions being implemented across threads, instructions are implemented on a single thread (i.e. instruction 1 is executed by thread 1, instruction 2 is executed by thread 2, etc). However, there are loop independent data dependencies which require the conditional variables.
Here we use the pthreads differently from the DOACROSS parallelism. The DOPIPE parallelism has each thread call a different function. Each function may have some loop independent dependence with some other function. Lets say function 2 depends on function 1, so function 1 will call pthread_cond_signal() once it is finished, and function 2 will call pthread_cond_wait().
The differences between DOPIPE and DOACROSS are that DOPIPE executes different functions on each thread, and it has the signal and wait functions called from different functions. Whereas DOACROSS executes the same function on each thread (it just uses different data), and it has the signal and wait functions called from the same function.
- include <pthread.h>
#include <stdio.h> #define NUM_THREADS 2 #define NUM_ELEMENTS 50 typedef struct { double a[NUM_ELEMENTS]; double b[NUM_ELEMENTS]; double c[NUM_ELEMENTS]; long threadid; } DATA; pthread_mutex_t mutexvar; //mutex variable required for conditional wait and signal functions pthread_cond_t condvar; //conditional variable required for conditional wait and signal functions void *Foo(void *data) { long tid; int i; tid = (long) data->threadid; for(i = 0; i < NUM_ELEMENTS; i++) { data->a[i] = (double) data->a[i - 1] + (double) data->b[i]; pthread_cond_signal(&condvar); //signal that its safe to continue pthread_mutex_unlock(&mutexvar); } pthread_exit(NULL); } void *Bar(void *data) { int i; for(i = 0; i < NUM_ELEMENTS; i++) { pthread_mutex_lock(&mutexvar); pthread_cond_wait(&condvar, &mutexvar); //wait until safe to continue data->c[i] = (double) data->a[i]; } } int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc; long t; DATA data; //ASSUME data struct is initialized //Initialized Mutex and Conditional Variable pthread_mutex_init(&mutexvar, NULL); pthread_cond_init (&condvar, NULL); printf("In main: creating thread %ld\n", t); rc = pthread_create(&threads[0], NULL, Foo, (void *)data); // create pthread that runs Foo if (rc){ printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } rc = pthread_create(&threads[1], NULL, Bar, (void *)data); // create pthread that runs Bar } pthread_exit(NULL); }
DOALL Parallelism
Since DOALL parallelism just means that all iterations are executed in parallel, and no dependences exits. All that is necessary is that the threads have to be created.
Intel Threading Building Blocks
According the to Intel® Software site, Intel® Threading Building Blocks (Intel® TBB) is a C++ template library that abstracts thread to tasks to create reliable, portable, and scalable parallel applications.
A goal of the library is similar to that of OpenMP, where the programmer is not required to have an extensive knowledge of thread programming or learn an entirely new language. Also, any compiler that supports ISO C++ can compile Intel® TBB code. However, as a drawback of its simplicity, certain types of parallelism such as DOACROSS and DOPIPE parallelism are not available for explicit usage here.
In order to use TBB, you must always insert the following code at the beginning of every file to include the TBB library and make its functions and variables available for use. In an attempt to simplify the examples below, this overhead will be omitted.
#include "tbb/tbb.h" using namespace tbb;
DOALL Parallelism
A DOALL parallel loop can be specified using the parallel_for() construct. This construct takes two parameters. The first is the range of indices of the loop that can be run in parallel. The second is a set of operations that can be processed as a unit and are safe to run concurrently. For a DOALL loop, this second parameter should include all possible loop indices. Also, an optional third parameter can be specified to define the chunk size of the loop and information about cache affinity.
// sequential loop for(i = 0; i < n; i++) { a[i] = b[i] + c[i]; }
A certain amount of overhead is necessary when creating a loop using DOALL parallelism. First of all, a class object must be created with a public function called operator() with a parameter of the form "const blocked_range<size_t>&" and a simple constructor must also be created in order for parallel_for to operate properly. This is further explained by the example DOALL loop below. This is a parallel version of the sequential loop above.
// parallel loop using Intel® TBB class SimpleLoop { public: void operator(const blocked_range<size_t>& r) { for(size_t i = r.begin(); i != r.end; i++) a[i] = b[i] + c[i]; } SimpleLoop(): {} }; void main(void) { parallel_for(blocked_range<size_t>(0,n), SimpleLoop()); }
OpenMP 2.0
What is OpenMP? OpenMP or Open Multi-processing is a multi-platform API used for shared addressed space programming. There are versions of OpenMP for both C/C++ and Fortran. The OpenMP libraries provide a list of compiler directives that easily allow for one to write shared memory parallel programs. Before explaining how to exploit the different types of parallelism supported by OpenMP 2.0 it is necessary to understand how to create threads.
Parallel Region
In OpenMP 2.0 in order to create threads, the C/C++ compiler directive used is: #pragma omp parallel and is enclosed by curly brackets (See reference for Fortran directive). This directive means the start of a parallel region, and at the start of a parallel threads are created. In order to specify the number of threads that are created in a parallel region the function omp_set_num_threads(int n) is called.
Inside a parallel region different compiler directive can be used to exploit different types of parallelism. The two types of parallelism discussed below are DOALL and function parallelism. With OpenMP 2.0 DOACROSS and DOPIPE parallelism cannot be expressed through the use of compiler directives.
DOALL parallelism
OpenMP exploits DOALL parallelism through the use of a simple directive that tells the compiler to execute a section of code on multiple threads. The C/C++ directive is: #pragma omp parallel for. (See reference for Fortran Directive). This directive if placed before a normal sequential for loop in C/C++ will execute all iterations of the for loop in parallel.
Function parallelism
OpenMP 2.0 can also exploit function parallelism with compiler directives. The C/C++ directive is: #pragma omp section (See reference for Fortran Directive). This directive is placed before a section of code that is to be executed by a single thread. For function parallelism, multiple data independent code blocks can each be placed inside a parallel region with the section compiler directive placed before each block of code.
OpenMP 3.0
In May 2008, OpenMP version 3.0 was released as an upgrade with some features such as tasks, synchronization primitives among others. For the scope of this article, nothing significant was upgraded regarding the execution of DOALL, DOACROSS, and DOPIPE parallelism, reduction, or function parallelism. Therefore, see sections above about OpenMP 2.0 since all discussions there are applicable to version 3.0.
References
- Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.
- Intel® Corporation, "Intel® TBB - Intel® Software Network", http://software.intel.com/en-us/intel-tbb/
- Intel® Corporation, "Intel® Threading Building Blocks 2.2 for Open Source", http://www.threadingbuildingblocks.org/
- Mark Bull, University of Edinburgh, "OpenMP 3.0 Overview", http://www.compunity.org/futures/Mark_SC06BOF.pdf