CSC/ECE 506 Spring 2010/ch 2 maf: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 39: Line 39:
     send_msg(P0, local_sum);
     send_msg(P0, local_sum);


In the code above, the three 8 element arrays are each divided into two 4 element chunks.  In the data parallel section, the code executed by the two threads is identical, but each thread operates on a different chunk of data.
In the code above, the three 8 element arrays are each divided into two 4 element chunks.  In the data parallel section, the code executed by the two processes is identical, but each process operates on a different chunk of data.


Hillis (1986) points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processors simply by dividing the data into smaller chunks.  Haveraaen (2000) also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write.  Comparison of the data parallel section of code identified above with the sequential Code 2.3 of Solihin (2008), which is reproduced below, supports this assertion.  The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable.  Structurally the two codes are identical.
Hillis (1986) points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks.  Haveraaen (2000) also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write.  Comparison of the data parallel section of code identified above with the sequential Code 2.3 of Solihin (2008), which is reproduced below, supports this assertion.  The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable.  Structurally the two codes are identical.


  // Sequential code from Solihin 2008, page 25.
  // Sequential code from Solihin 2008, page 25.
Line 79: Line 79:
  }
  }


In the code above, work is divided into two parallel tasks.  The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.''  The other sums the elements of ''a.''  These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed in each thread is different (rather than identical).
In the code above, work is divided into two parallel tasks.  The first performs the element-wise addition of arrays ''b'' and ''c'' and stores the result in ''a.''  The other sums the elements of ''a.''  These tasks both operate on all elements of ''a'' (rather than on separate chunks), and the code executed by each process is different (rather than identical).


Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated.  This is in contrast to data parallel algorithms, for which the number of parallel tasks may be increased by partitioning the data into a greater number of smaller subsets.  Haveraaen (2000) also notes that task parallel algorithms are inherently more complex, requiring more elaborate methods of communication and synchronization.  Haveraaen (2000) claims that data parallel algorithms on the other hand appear much like sequential algorithms and are therefore easier to write and to read.
Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated.  This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements.  Haveraaen (2000) also notes that task parallel algorithms are inherently more complex, requiring more elaborate methods of communication and synchronization.  In the task parallel code above, after thread 0 computes an element of ''a'' it must send it to thread 1.  To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the processes.  In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end.


The shared memory and message passing models do not, in general, impose constraints on what kind of work is to be performed by each task, and the data parallel model does not necessarily require a particular methodology for controlling access to common data.
The shared memory and message passing models do not, in general, impose constraints on what kind of work is to be performed by each task, and the data parallel model does not necessarily require a particular methodology for controlling access to common data.

Revision as of 05:42, 29 January 2010

Supplement to Chapter 2: The Data Parallel Programming Model

Chapter 2 of Solihin (2008) covers the shared memory and message passing parallel programming models. However, it does not address the data parallel model, another commonly recognized parallel programming model covered in other treatments like Foster (1995) and Culler (1999). The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of Solihin (2008). The task parallel model will also be introduced briefly as a point of contrast.

Overview

Whereas the shared memory and message passing models focus on how parallel tasks access common data, the data parallel model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data. An example of a data parallel code can be seen in Code 2.5 from Solohin (2008) which is reproduced below. It has been annotated with comments identifying the region of the code which is data parallel.

// Message passing code, adapted from Solohin (2008), p. 27.

id = getmyid(); // Assume id = 0 for thread 0, id = 1 for thread 1
local_iter = 4;
start_iter = id * local_iter;
end_iter = start_iter + local_iter;

if (id == 0)
    send_msg(P1, b[4..7], c[4..7]);
else
    recv_msg(P0, b[4..7], c[4..7]);

// Begin data parallel section

for (i = start_iter; i < end_iter; i++)
    a[i] = b[i] + c[i];
local_sum = 0;
for (i = start_iter; i < end_iter; i++)
    if (a[i] > 0)
        local_sum = local_sum + a[i];

// End data parallel section

if (id == 0)
{
    recv_msg(P1, &local_sum1);
    sum = local_sum + local_sum1;
    Print sum;
}
else
    send_msg(P0, local_sum);

In the code above, the three 8 element arrays are each divided into two 4 element chunks. In the data parallel section, the code executed by the two processes is identical, but each process operates on a different chunk of data.

Hillis (1986) points out that a major benefit of data parallel algorithms is that they easily scale to take advantage of additional processing elements simply by dividing the data into smaller chunks. Haveraaen (2000) also notes that data parallel codes typically bear a strong resemblance to sequential codes, making them easier to read and write. Comparison of the data parallel section of code identified above with the sequential Code 2.3 of Solihin (2008), which is reproduced below, supports this assertion. The only differences between the two codes are the start and end indices and that, in the data parallel example, the variable sum is replaced by a private variable. Structurally the two codes are identical.

// Sequential code from Solihin 2008, page 25.

for (i = 0; i < 8; i++)
    a[i] = b[i] + c[i];
sum = 0;
for (i = 0; i < 8; i++)
    if (a[i] > 0)
        sum = sum + a[i];
Print sum;

The logical opposite of data parallel is task parallel, in which a number of distinct tasks operate on common data. An example of a task parallel code which is functionally equivalent to the sequential and data parallel codes given above follows below.

// Task parallel code.

int id = getmyid(); // assume id = 0 for thread 0, id = 1 for thread 1

if (id == 0)
{
    for (i = 0; i < 8; i++)
    {
        a[i] = b[i] + c[i];
        send_msg(P1, a[i]);
    }
}
else
{
    sum = 0;
    for (i = 0; i < 8; i++)
    {
        recv_msg(P0, a[i]);
        if (a[i] > 0)
            sum = sum + a[i];
    }
    Print sum;
}

In the code above, work is divided into two parallel tasks. The first performs the element-wise addition of arrays b and c and stores the result in a. The other sums the elements of a. These tasks both operate on all elements of a (rather than on separate chunks), and the code executed by each process is different (rather than identical).

Since each parallel task is unique, a major limitation of task parallel algorithms is that the maximum degree of parallelism attainable is limited to the number of tasks that have been formulated. This is in contrast to data parallel algorithms, which can be scaled easily to take advantage of an arbitrary number of processing elements. Haveraaen (2000) also notes that task parallel algorithms are inherently more complex, requiring more elaborate methods of communication and synchronization. In the task parallel code above, after thread 0 computes an element of a it must send it to thread 1. To support this, sends and receives occur every iteration of the two loops, resulting in a total of 8 messages being sent between the processes. In contrast, the data parallel code sends only 2 messages, one at the beginning and one at the end.

The shared memory and message passing models do not, in general, impose constraints on what kind of work is to be performed by each task, and the data parallel model does not necessarily require a particular methodology for controlling access to common data.

In fact, the data parallel model can be used in conjunction with either the shared memory or the message passing model, as explored by Klaiber (1994).

Comparing the Data Parallel Model with the Shared Memory and Message Passing Models

Comparison between shared memory, message passing, and data parallel programming models (adapted from Solihin 2008, page 22).
Aspects Shared Memory Message Passing Data Parallel
Communication implicit (via loads/stores) explicit messages implicit
Synchronization explicit implicit (via messages) typically implicit
Hardware support typically required none
Development effort lower higher higher
Tuning effort higher lower

A Code Example

// Data parallel implementation in C++ with OpenMP.

#include <omp.h>
#include <iostream>

int main(void)
{
    double a[8], b[8], c[8], localSum[2];
    long s = 4;
    int id, i;

    #pragma omp parallel for private(id, i) reduction(+:s)
    for (i = 0; i < 8; i++)
    {
        a[i] = b[i] + c[i];
    }

    for (i = 0; i < 2; i++) localSum[i] = 0;

    #pragma omp parallel for private(id, i) reduction(+:s)
    for (i = 0; i < 8; i++)
    {
        id = omp_get_thread_num();
        if (a[i] > 0)
            localSum[id] = localSum[id] + a[i];
    }

    double sum = localSum[0] + localSum[1];
    std::cout << sum << std::endl;
}

Hardware Examples

Definitions

  • Data parallel
  • Task parallel

References

  • David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, Parallel Computer Architecture: A Hardware/Software Approach, Morgan-Kauffman, 1999.
  • Ian Foster, Designing and Building Parallel Programs, Addison-Wesley, 1995.
  • Magne Haveraaen, "Machine and collection abstractions for user-implemented data-parallel programming," Scientific Programming, 8(4):231-246, 2000.
  • W. Daniel Hillis and Guy L. Steele, Jr., "Data parallel algorithms," Communications of the ACM, 29(12):1170-1183, December 1986.
  • Alexander C. Klaiber and Henry M. Levy, "A comparison of message passing and shared memory architectures for data parallel programs," in Proceedings of the 21st Annual International Symposium on Computer Architecture, April 1994, pp. 94-105.
  • Yan Solihin, Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems, Solihin Books, 2008.