CSC/ECE 506 Spring 2010/ch 3 jb/Parallel Programming Model Support

From Expertiza_Wiki
Jump to navigation Jump to search

Parallel programming reduces execution time versus sequential programming by taking advantage of code structure. Generally, parallelism is easily found by investigating the blocks of code that take the longest to execute. Parallelism can take several different forms but is most often found in loop structures. The focus of this article is to discuss the implementation of parallel programming models related to loop structure, specifically DOALL, DOACROSS and DOPIPE parallelism, reduction, and functional parallelism.

In practice, there are various C/C++ code libraries that offer parallel programming support without needing to learn a new language or programming model. These include Posix threads, Intel® Threading Building Blocks, and OpenMP. Each of these will be discussed in detail below with references to outside material for more comprehensive information, if desired.


Posix threads

POSIX thread, also referred to as a pthread is used in shared address space architectures for parallel programs. Through the use of the pthread API, various functions can be used to create and manage pthreads. In order to fully, understand how pthreads can be used to exploit DOACROSS, DOPIPE, and DOALL parallelism a brief introduction to creating/terminating pthreads, mutexes, and conditional variables are necessary.

Creating and Terminating Pthreads

In order to create a pthread, the API provides the pthread_create() function. The pthread function accepts 4 arguments: thread, attr, start_routine, and arg. The thread argument is used to provide a unique identifier for the thread you are creating. The attr argument is used to specify a threads attribute object, or use default attributes by passing NULL. For the examples discussed the default attributes will be sufficient; for more information on setting thread attributes please see references. The start_routine argument is the program subroutine that will be executed by the thread being created. The arg argument is used to pass an argument to the subroutine that is being executed by the thread being created (value can be set to NULL if no argument is being pass to the subroutine).

In order to terminate pthreads, the API provides the pthread_exit() function. In order to terminate a pthread, the thread being run simply has to call this function (even the main thread). NOTE: there are alternate methods for terminating pthreads not discussed here for convenience and simplicity.

 #include <pthread.h>
 #include <stdio.h>
 #define NUM_THREADS     5

 void *Foo(void *threadid)
 {
    long tid;
    tid = (long)threadid;
    ....
    ....
    pthread_exit(NULL);
 }

 int main (int argc, char *argv[])
 {
    pthread_t threads[NUM_THREADS];
    int rc;
    long t;
    for(t=0; t<NUM_THREADS; t++){
       printf("In main: creating thread %ld\n", t);
       rc = pthread_create(&threads[t], NULL, Foo, (void *)t);  // create pthread
       if (rc){
          printf("ERROR; return code from pthread_create() is %d\n", rc);
          exit(-1);
       }
    }
    pthread_exit(NULL);
 }

The code above shows a very simple example of how to create threads using pthreads. Notice the arguments passed to the pthread_create() function, and note their syntax. For completeness, the example code above shows a simple function being run by every thread.

Mutexes

A mutex variable is a variable that must only be accessed by a single thread (mutex is short for mutual exclusion). The API provides the pthread_mutex_t data type in order statically create a mutex variable, and the pthread_mutex_init() function to create it dynamically.

Mutex variables are used to implement locks, so that multiple pthreads do not access critical data in a program. The API provides the pthread_mutex_lock() and pthread_mutex_unlock() functions. These functions simply lock or unlock the mutex variable specified.

Conditional Variables

Conditional variables allow for point-to-point synchronization between threads. The API provides a few useful functions in order to synchronize threads: pthread_cond_wait() and pthread_cond_signal(). The pthread_cond_wait() blocks all threads until the specified condition is satisfied. The pthread_cond_signal() wakes up another thread that is waiting on the condition to be satisfied. These functions are the pthread specific functions that are analogous to the general wait() and post() function discussed in the Sohlihin text.

DOACROSS Parallelism

In order to exploit DOACROSS parallelism using pthreads, conditions are needed in order to synchronize the threads. Since instructions are executed across iterations, and data dependencies exist across iterations are assumed (See Sohlin Text), the conditional variables shown above are used to ensure the correct execution of the code.

Lets take a simple example where each thread calculates A[i] = A[i-1] + B[i]. Point-to-point synchronization is necessary in order to make sure A[i-1] is not read before its value is written. This is where the pthread’s conditional variable comes is useful. We put pthread_cond_wait() and pthread_cond_signal() around the instruction above, in order to make sure the previous thread has signaled that its completed before the current thread performs its own computation.

 #include <pthread.h>
 #include <stdio.h>
 #define NUM_THREADS     50
 #define NUM_ELEMENTS    50
  
 typedef struct 
  {
    double      a[NUM_ELEMENTS];
    double      b[NUM_ELEMENTS];
    long          threadid;
  } DATA;
 
 pthread_mutex_t mutexvar; //mutex variable required for conditional wait and signal functions
 pthread_cond_t condvar;  //conditional variable required for conditional wait and signal functions
 
 void *Foo(void *data)
 {
    long tid;
    tid = (long) data->threadid;
    pthread_mutex_lock(&mutexvar);
    pthread_cond_wait(&condvar, &mutexvar);   //wait until safe to continue
    data->a[tid] = (double) data->a[tid - 1] + (double) data->b[tid];
    pthread_cond_signal(&condvar);                    //signal that its safe to continue
    pthread_mutex_unlock(&mutexvar);
    pthread_exit(NULL);
 }

 int main (int argc, char *argv[])
 {
    pthread_t threads[NUM_THREADS];
    int rc;
    long t;
    DATA data;
  
    //ASSUME data struct is initialized
  
   //Initialized Mutex and Conditional Variable
  pthread_mutex_init(&mutexvar, NULL);
  pthread_cond_init (&condvar, NULL);           
  
    for(t=0; t<NUM_THREADS; t++){
       printf("In main: creating thread %ld\n", t);
       rc = pthread_create(&threads[t], NULL, Foo, (void *)data);  // create pthread
       if (rc){
          printf("ERROR; return code from pthread_create() is %d\n", rc);
          exit(-1);
       }
    }
    pthread_exit(NULL);
 }

The code block above shows roughly how use conditional variable in order to exploit DOACROSS parallelism.

DOPIPE Parallelism

In order to exploit DOPIPE parallelism, conditions are also needed in order to synchronize threads. Instead of instructions being implemented across threads, instructions are implemented on a single thread (i.e. instruction 1 is executed by thread 1, instruction 2 is executed by thread 2, etc). However, there are loop independent data dependencies which require the conditional variables.

Here we use the pthreads differently from the DOACROSS parallelism. The DOPIPE parallelism has each thread call a different function. Each function may have some loop independent dependence with some other function. Lets say function 2 depends on function 1, so function 1 will call pthread_cond_signal() once it is finished, and function 2 will call pthread_cond_wait().

The differences between DOPIPE and DOACROSS are that DOPIPE executes different functions on each thread, and it has the signal and wait functions called from different functions. Whereas DOACROSS executes the same function on each thread (it just uses different data), and it has the signal and wait functions called from the same function.

 #include <pthread.h>
 #include <stdio.h>
 #define NUM_THREADS     2
 #define NUM_ELEMENTS    50
  
 typedef struct 
  {
    double      a[NUM_ELEMENTS];
    double      b[NUM_ELEMENTS];
    double      c[NUM_ELEMENTS];
    long          threadid;
  } DATA;
 
 pthread_mutex_t mutexvar; //mutex variable required for conditional wait and signal functions
 pthread_cond_t condvar;  //conditional variable required for conditional wait and signal functions
 
 void *Foo(void *data)
 {
    long tid;
    int i;
    tid = (long) data->threadid;
    for(i = 0; i < NUM_ELEMENTS; i++)
    {
      data->a[i] = (double) data->a[i - 1] + (double) data->b[i];
      pthread_cond_signal(&condvar);                    //signal that its safe to continue
      pthread_mutex_unlock(&mutexvar);
    }
    pthread_exit(NULL);
 }
  
 void *Bar(void *data)
 {
     int i;
     for(i = 0; i < NUM_ELEMENTS; i++)
    {
      pthread_mutex_lock(&mutexvar);
      pthread_cond_wait(&condvar, &mutexvar);   //wait until safe to continue
      data->c[i] = (double) data->a[i];
    }
 }
 
 int main (int argc, char *argv[])
 {
    pthread_t threads[NUM_THREADS];
    int rc;
    long t;
    DATA data;
  
    //ASSUME data struct is initialized
  
   //Initialized Mutex and Conditional Variable
   pthread_mutex_init(&mutexvar, NULL);
   pthread_cond_init (&condvar, NULL);           
  
    
       printf("In main: creating thread %ld\n", t);
       rc = pthread_create(&threads[0], NULL, Foo, (void *)data);  // create pthread that runs Foo
       if (rc){
          printf("ERROR; return code from pthread_create() is %d\n", rc);
          exit(-1);
       }
       rc = pthread_create(&threads[1], NULL, Bar, (void *)data);  // create pthread that runs Bar
    }
    pthread_exit(NULL);
 }

DOALL Parallelism

Since DOALL parallelism just means that all iterations are executed in parallel, and no dependences exists, all that is necessary is that the threads have to be created and have each thread execute some data set for a single function. If you you refer back up to the pthread creation example, that code block exploits DOALL parallelism.

Intel Threading Building Blocks

According to the Intel® Software site, Intel® Threading Building Blocks (Intel® TBB) is a C++ template library that abstracts threads to tasks to create reliable, portable, and scalable parallel applications.

A goal of the library is similar to that of OpenMP where the programmer is not required to have an extensive knowledge of thread programming or learn an entirely new language. Instead they learn how to use the API for the TBB library with new or existing C++ code. Also, any compiler that supports ISO C++ can compile Intel® TBB code with a simple extra command. However, as a drawback of its simplicity, certain types of parallelism such as DOACROSS parallelism are not available for explicit usage using the library. For more comprehensive information than is covered in this discussion, please see Intel®'s open source site.

Overhead

In order to use TBB, you must always insert the following code at the beginning of every file to include the TBB library and make its functions and variables available for use. In an attempt to simplify the examples below, this overhead will be omitted.

#include "tbb/tbb.h"

using namespace tbb;

DOALL Parallelism

A DOALL parallel loop can be specified using the parallel_for() construct. This function takes two parameters. The first is the range of indices of the loop that can be run in parallel. The second is a set of operations that can be processed as a unit and are safe to run concurrently. A simple sequential code loop is showed below followed by the same code being executed using parallel iterations with DOALL parallelism. This example is adapted from the Intel® TBB Tutorial on pages 10 and 11.

// sequential loop
for(size_t i = 0; i < n; i++) {
  a[i] = b[i] + 5;
}

A certain amount of overhead is necessary when creating a loop using DOALL parallelism in TBB. First of all, a class object must be created with a public function called operator(). This function should have a parameter of the form "const blocked_range<size_t>&" and a constructor must also be created for any arguments are passed for parallel_for to operate properly. This is further explained by the example parallel loop below. This is a parallel version of the sequential loop above.

// parallel loop using Intel® TBB
class SimpleLoop {
  float *a = my_a;   // make local copies of constructor arguments
  float *b = my_b;
public:
  // operator function describes parallel action taken by object 
  void operator(const blocked_range<size_t>& r) {
    float *a = my_a;
    float *b = my_b;
    for(size_t i = r.begin(); i <= r.end(); i++)
      a[i] = b[i] + 5;
  }
  SimpleLoop(float *a, float *b):
    my_a(a);   // set value of local vars to arguments
    my_b(b);
};

void main(void) {
  float a[], float b[];
  int n;
  ...
  ...
  // calls SimpleLoop.operator() on range 0 to n-1
  parallel_for(blocked_range<size_t>(0,n), SimpleLoop(a, b));
}

DOPIPE Parallelism

Exploiting DOPIPE parallelism is more complicated than DOALL parallelism in TBB. The pipeline and filter classes are used to implement the pipeline pattern.

// sequential loop amenable to DOPIPE parallelism
for(size_t i = 0; i < n; i++) {
  a[i] = a[i-1] + 5.3;
  b[i] = a[i] + 2.7;
}

In the DOPIPE code below, each section of code that can be separated into a chunk should be defined in a different filter class instance. For the sequential code example, there are two separate statements that can be executed in parallel; therefore there are two filter instances created. In order to run them in parallel, a pipeline instance needs to be created and given passed the instances of the created filter instantiations. Now, simply run the pipeline over n iterations and you have pipelined parallelism in TBB. Notice that the operator() function is called to actually perform the code execution as for all the previous TBB examples. This example is adapted from the Intel® TBB Tutorial on pages 25 - 29.

//first pipeline operation
class Filter1: public filter {
  float* my_a;     // make local copies of constructor arguments
public:
  // operator function describes pipelined action taken by object    
  void operator() {
    my_a[i] = my_a[i-1] + 5.3;
  }
  
  Filter1(float *a) :
    filter(serial_in_order);  //execute the actions of this filter in order
    my_a(a);  // set value of local vars to arguments
  {}
};

//second pipeline operation
class Filter2: public filter {
  float* my_a;   // make local copies of constructor arguments
  float* my_b;
public:   
  // operator function describes pipelined action taken by object 
  void operator() {
    my_b[i] = my_a[i] + 2.7;
  }
  
  Filter2(float *a, float *b) :
    filter(serial_in_order);  //execute the actions of this filter in order
    my_a(a);  // set value of local vars to arguments
    my_b(b);
  {}
};

// pipeline creation and execution
void main(void) {
  size_t n;
  float a[], b[];
  ...
  //first pipeline action
  Filter1 filter1(a);
  pipeline.add_filter(filter1);
  
  //second pipeline action
  Filter2 filter2(a,b);
  pipeline.add_filter(filter2);
  
  //execute pipelined operations n-1 times
  pipeline.run(n);
  
  pipeline.clear();
}

Reduction

Reduction is a way of removing conflicts in parallel code. Reduction can be performed for addition and multiplication along with logical operators such as AND and OR. In TBB, reduction can be specified using the parallel_reduce() construct. This function takes two parameters just like we have seen from parallel_for() above. The first is the range of indices of the loop that have an operation to reduce in parallel. The second is a set of operations that can be processed as a unit and are safe to run concurrently. The code example described in this section involves calculating a sum of variables in an array. The sequential code version is shown below.

// sequential loop summation
float sum = 0;
for(size_t i = 0; i < n; i++)
  sum += a[i];

Instead of using a single sum variable, in TBB partial sums are calculated in parallel and then summed together to form a total sum at the end. This example is adapted from the Intel® TBB Tutorial on pages 18-20.

// parallel reduction using Intel® TBB
class SimpleSum {
  float *my_a;   // make local copies of constructor arguments
public:
  float *my_sum;
  // operator function describes pipelined action taken by object 
  void operator(const blocked_range<size_t>& r) {
    float *a = my_a;
    float sum = my_sum;
    size_t end = r.end();
    for(size_t i = r.begin(); i < r.end(); i++)
      sum += a[i];
    my_sum = sum;
  }

  SimpleSum(SimpleSum& x, split): my_a(x.my_a), my_sum(0) {}

  // join performs the final summation of the individual summations
  void join(const SimpleSum& y) {
    my_sum += y.my_sum;
  }

  SimpleSum(float a[]) :
    my_a(a), my_sum(0)  // set value of local vars to arguments
  {}
};

void main(void) {
  float a[], sum;
  int n;
  ...
  ...
  SimpleSum ss(a);
  //perform sum by reduction for n-1 iterations
  parallel_reduce(blocked_range<size_t>(0,n), ss);
  sum = ss.my_sum;
}

In the version of the code using parallel_reduce(), there a few changes to the format used in the parallel_for example above that are slightly more complex. A class must be created to perform the parallel operations as before but this time the splitting and combining operations of reduction must be specified. The splitting is found in the constructor and the joining is found in the join() function as a member of the class.

Flexibility

TBB also boasts of the ability to combine its package with other threading packages, such as OpenMP and Posix threads. This allows the programmer more flexibility to mix and match packages to obtain, for example, the functional parallelism benefits of OpenMP 2.0 and still retain the ability to use reduction in TBB.

OpenMP 2.0

What is OpenMP? OpenMP or Open Multi-processing is a multi-platform API used for shared addressed space programming. There are versions of OpenMP for both C/C++ and Fortran. The OpenMP libraries provide a list of compiler directives that easily allow for one to write shared memory parallel programs. Before explaining how to exploit the different types of parallelism supported by OpenMP 2.0 it is necessary to understand how to create threads. Section 3.7 on page 60 of the Solihin text has an overview of OpenMP 2.0 that should be used to supplement the information covered here.

Parallel Region

In OpenMP 2.0 in order to create threads, the C/C++ compiler directive used is: #pragma omp parallel and is enclosed by curly brackets (See reference for Fortran directive). This directive means the start of a parallel region, and at the start of a parallel threads are created. In order to specify the number of threads that are created in a parallel region the function omp_set_num_threads(int n) is called.

Inside a parallel region different compiler directive can be used to exploit different types of parallelism. The two types of parallelism discussed below are DOALL and function parallelism. With OpenMP 2.0 DOACROSS and DOPIPE parallelism cannot be expressed through the use of compiler directives. See the "Did you know?" box in Chapter 3 of the Solihin text on page 63 for more information about the DOACROSS and DOPIPE parallelism.

#include <omp.h>

main ()  {

int nthreads, tid;

/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
 {

 /* Obtain and print thread id */
  tid = omp_get_thread_num();
  printf("Hello World from thread = %d\n", tid);

   /* Only master thread does this */
   if (tid == 0) 
    {
    nthreads = omp_get_num_threads();
    printf("Number of threads = %d\n", nthreads);
    }
 
  }  /* All threads join master thread and terminate */
 
}

DOALL parallelism

OpenMP exploits DOALL parallelism through the use of a simple directive that tells the compiler to execute a section of code on multiple threads. The C/C++ directive is: #pragma omp parallel for. (See reference for Fortran Directive). This directive if placed before a normal sequential for loop in C/C++ will execute all iterations of the for loop in parallel.

 #include<omp.h>
 ....
 #pragma omp parallel 
 {
    ...
    #pragma omp parallel for default(shared)  // DOALL: all iterations are done in parallel
    for(i=0; i<n; i++)
       A[i] = B[i]+C[i];
 }

The simple code block above shows how OpenMP 2.0 creates a parallel region that exploit DOALL parallelism. Notice how much this resemble sequential programming. If the compiler directives were taken out, this code could be run on a sequential machine.

Function parallelism

OpenMP 2.0 can also exploit function parallelism with compiler directives. The C/C++ directive is: #pragma omp section (See reference for Fortran Directive). This directive is placed before a section of code that is to be executed by a single thread. For function parallelism, multiple data independent code blocks can each be placed inside a parallel region with the section compiler directive placed before each block of code.

 #include<omp.h>
 ...
 #pragma omp parallel
 {
   ...
   #pragma omp section
   for(i=0;i<n;i++)
       A[i] = A[i] + B[i];
   #pragma omp section
   for(i=0;i<n;i++)
       C[i]=C[i-1] + 5;  
 } 

The code block above shows how to exploit function parallelism. Each section compiler directive tell the following code to be executed on a single thread. The code above is broken into 2 sections, which mean the two for loops are handled by 2 threads.

OpenMP 3.0

In May 2008, OpenMP version 3.0 was released as an upgrade with some features such as tasks, synchronization primitives among others. For the scope of this article, nothing significant was upgraded regarding the execution of DOALL, DOACROSS, and DOPIPE parallelism, reduction, or function parallelism. Therefore, see sections above about OpenMP 2.0 since all discussions there are applicable to version 3.0.

Summary

A brief comparison of the code libraries discussed in this article will be made here in the form of a table. As you can see, none of the libraries contain support for all of the types of parallelism discussed.

Type of Parallelism Posix Threads Intel® TBB OpenMP 2.0 OpenMp 3.0
DOALL Yes Yes Yes Yes
DOACROSS Yes No No No
DOPIPE Yes Yes No No
Reduction No Yes No No
Functional Parallelism No No Yes Yes

References

  • Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, August 2009.