CSC/ECE 506 Spring 2011/ch4a bm: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(14 intermediate revisions by 2 users not shown)
Line 3: Line 3:


= Gaussian Elimination =
= Gaussian Elimination =
Guassian Elimination is common method used to solve a system of linear equations.  The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.  The method consists of two steps:  forward reduction and back substitution.  The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.
Gaussian Elimination is a common method used to solve a system of linear equations.  The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.<sup><span id="5body">[[#5foot|[5]]]</span></sup> The method consists of two steps:  forward reduction and back substitution.  The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.


'''Forward Reduction'''
'''Forward Reduction'''


The first step is to reduce the equation matrix to [http://en.wikipedia.org/wiki/Echelon_form row echelon form].  In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1.  A couple of example are best to illustrate:
The first step is to reduce the equation matrix to [http://en.wikipedia.org/wiki/Echelon_form row-echelon form].  In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1.  A couple of examples will help to illustrate row-echelon form:


<blockquote>
<blockquote>
Line 146: Line 146:


== Data Parallel ==
== Data Parallel ==
The following section of code implements Gaussian Elimination via data parallel with HPF.  It is found in the book ''Designing and Building Parallel Programs (Online)'' by Ian Foster.<sup><span id="2body">[[#2foot|[2]]]</span></sup> <br>
  1: subroutine gauss(n, A, X)
  2: integer n
  3: real A(n, n+1)
  5: real X(n), Fac(n), Row(n+1)
  6: integer Indx(n), Itmp(1)
  7: integer i, j, k, max_indx
  8: real maxval
  9:
  10: Indx = 0 ! Initialize mask array
  11: do i = 1, n
  12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
  13: max_indx = Itmp(1) ! Extract pivot index
  14: Indx(max_indx) = i ! Update indirection array
  15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column
  16: Row = A(max_indx,:) ! Extract pivot row
  17:
  18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
  19: A(j,k) = A(j,k) - Fac(j) * Row(k)
  20: end do
  21:
  22: FORALL (j=1:n)
  23: A(Indx(j),:) = A(j,:) ! Row exchange
  24:
  25: DO j = n,1,-1 ! Back substitution
  26: X(j) = A(j,n+1) / A(j,j)
  27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
  28: ENDDO
  29: end sub
<br>
This data parallel code works by performing operations on entire rows of the equation matrix.  In the first loop, the forward reduction is performed.  The variable <code>i</code> is an index which keeps track of the column (and thus row) that is currently being reduced.
Instead of reducing the rows starting with row 1, we will instead work with the row that has the largest element in the pivot column that hasn't already been used.  We will find the row with which to work at the beginning of each iteration. (lines 12-14)
Next, we divide every element in the current row by the value in the column we are reducing. (lines 15-16)
To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0).  This part of the algorithm is extremely parallelizable.  The <code>FORALL</code> statement tells the compiler that this statement is parallelizable, and this statement implies data parallelism. (lines 18-20)
Lines 22-24 exchange rows to put the matrix in row-echelon form, and the <code>FORALL</code> statement again shows that this operation is highly parallelizable.
The last loop does the back substitution.  The loop starts on the end and works its way to the first row.
First, the last row is solved. (line 26)
Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)


== Shared Memory ==
== Shared Memory ==
The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF.  It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.<sup><span id="1body">[[#1foot|[1]]]</span></sup> <br>
  1: do pivot = 1, (n-1)
  2: !$omp parallel do private(xmult) schedule(runtime)
  3: do i = (pivot+1), n
  4: xmult = a(i,pivot) / a(pivot,pivot)
  5: do j = (pivot+1), n
  6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
  7: end do
  8: b(i) = b(i) - (xmult * b(pivot))
  9: end do
  10:  1: !$omp end parallel do
  11: end do
<br>
As is readily seen, this code is short and simple, so we will analyze it line by line.  To summarize the method, each row is solved serially, with the columns each being normalized and subtracted from the remaining rows in parallel. 
The pivot column is the column currently being reduced (and subsequently, also the row).
  1: do pivot = 1, (n-1)
Loop through each row for forward reduction.
  2: !$omp parallel do private(xmult) schedule(runtime)
Spawn parallel threads here, making <code>xmult</code> private.
  3: do i = (pivot+1), n
Loop through all columns in the current row starting from the pivot.
  4: xmult = a(i,pivot) / a(pivot,pivot)
Divide the i'th column by the pivot column.  This is the normalization step.
  5: do j = (pivot+1), n
  6: a(i,j) = a(i,j) - (xmult * a(pivot,j))
  7: end do
Now, we loop through all other rows and update them with the just normalized column.  This involves subtracting it the needed number of times.
  8: b(i) = b(i) - (xmult * b(pivot))
We also update the solution column, which is in the <code>b</code> array.
  9: end do
  10:  1: !$omp end parallel do
  11: end do
Back substitution would be done next, but is not shown.


== Message Passing ==
== Message Passing ==
Line 262: Line 348:
<span id="3foot">[[#3body|3.]]</span> [http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html MPI: A Message-Passing Interface Standard] <br>
<span id="3foot">[[#3body|3.]]</span> [http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html MPI: A Message-Passing Interface Standard] <br>
<span id="4foot">[[#4body|4.]]</span> Wikipedia's [http://en.wikipedia.org/wiki/Fortran FORTRAN] page <br>
<span id="4foot">[[#4body|4.]]</span> Wikipedia's [http://en.wikipedia.org/wiki/Fortran FORTRAN] page <br>
<span id="5foot">[[#4body|4.]]</span> Wikipedia's [http://en.wikipedia.org/wiki/Gaussian_elimination Gaussian Elimination] page <br>
<span id="5foot">[[#5body|5.]]</span> Wikipedia's [http://en.wikipedia.org/wiki/Gaussian_elimination Gaussian Elimination] page <br>

Latest revision as of 05:09, 28 February 2011

Overview

Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.

Gaussian Elimination

Gaussian Elimination is a common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.[5] The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.

Forward Reduction

The first step is to reduce the equation matrix to row-echelon form. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of examples will help to illustrate row-echelon form:

1 2 -4 2
0 1 3 = 5
0 0 1 1


1 7 3 0 -4
0 0 1 10 = 0.5
0 0 0 1 6

Back Substitution

In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:

1 2 -4 2
0 1 3 = 5
0 0 1 1

Substitute 1 for the third element in equation 2, and subtract 3 from both sides:

1 2 -4 2
0 1 0 = 2
0 0 1 1

Substitute 1 for the third element and 2 for the second element in equation 1, and solve:

1 0 0 2
0 1 0 = 2
0 0 1 1

FORTRAN Background

The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array A

  • Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
  • All elements of an array can be set to a value by simply setting the array variable equal to a value, as in A = 0
  • A DO loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as in A(a:b) = A(a:b) * 2
  • In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in A(1, :) = 2

Parallel Implementations

Data Parallel

The following section of code implements Gaussian Elimination via data parallel with HPF. It is found in the book Designing and Building Parallel Programs (Online) by Ian Foster.[2]

 1: subroutine gauss(n, A, X)
 2: 	integer n
 3: 	real A(n, n+1)
 5: 	real X(n), Fac(n), Row(n+1)
 6: 	integer Indx(n), Itmp(1)
 7: 	integer i, j, k, max_indx
 8: 	real maxval
 9: 
 10: 	Indx = 0						! Initialize mask array
 11: 	do i = 1, n
 12: 		Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0)	! find pivot
 13: 		max_indx = Itmp(1)				! Extract pivot index
 14: 		Indx(max_indx) = i				! Update indirection array
 15: 		Fac = A(:,i)/A(max_indx,i)			! scale factors for column
 16: 		Row = A(max_indx,:)				! Extract pivot row
 17: 								
 18:		FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0)		! Row Update
 19: 			A(j,k) = A(j,k) - Fac(j) * Row(k)
 20: 	end do
 21: 
 22: 	FORALL (j=1:n)
 23: 		A(Indx(j),:) = A(j,:)				! Row exchange
 24: 
 25: 	DO j = n,1,-1						! Back substitution
 26: 		X(j) = A(j,n+1) / A(j,j)
 27: 		A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
 28: 	ENDDO
 29: end sub


This data parallel code works by performing operations on entire rows of the equation matrix. In the first loop, the forward reduction is performed. The variable i is an index which keeps track of the column (and thus row) that is currently being reduced.

Instead of reducing the rows starting with row 1, we will instead work with the row that has the largest element in the pivot column that hasn't already been used. We will find the row with which to work at the beginning of each iteration. (lines 12-14)

Next, we divide every element in the current row by the value in the column we are reducing. (lines 15-16)

To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0). This part of the algorithm is extremely parallelizable. The FORALL statement tells the compiler that this statement is parallelizable, and this statement implies data parallelism. (lines 18-20)

Lines 22-24 exchange rows to put the matrix in row-echelon form, and the FORALL statement again shows that this operation is highly parallelizable.

The last loop does the back substitution. The loop starts on the end and works its way to the first row.

First, the last row is solved. (line 26) Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)

Shared Memory

The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]

 1: do pivot = 1, (n-1)
 2: !$omp parallel do private(xmult) schedule(runtime)
 3: 	do i = (pivot+1), n
 4: 		xmult = a(i,pivot) / a(pivot,pivot)
 5: 		do j = (pivot+1), n
 6: 			a(i,j) = a(i,j) - (xmult * a(pivot,j))
 7: 		end do
 8: 		b(i) = b(i) - (xmult * b(pivot))
 9: 	end do
 10:   1: !$omp end parallel do
 11: end do


As is readily seen, this code is short and simple, so we will analyze it line by line. To summarize the method, each row is solved serially, with the columns each being normalized and subtracted from the remaining rows in parallel.

The pivot column is the column currently being reduced (and subsequently, also the row).

 1: do pivot = 1, (n-1)

Loop through each row for forward reduction.

 2: !$omp parallel do private(xmult) schedule(runtime)

Spawn parallel threads here, making xmult private.

 3: 	do i = (pivot+1), n

Loop through all columns in the current row starting from the pivot.

 4: 		xmult = a(i,pivot) / a(pivot,pivot)

Divide the i'th column by the pivot column. This is the normalization step.

 5: 		do j = (pivot+1), n
 6: 			a(i,j) = a(i,j) - (xmult * a(pivot,j))
 7: 		end do

Now, we loop through all other rows and update them with the just normalized column. This involves subtracting it the needed number of times.

 8: 		b(i) = b(i) - (xmult * b(pivot))

We also update the solution column, which is in the b array.

 9: 	end do
 10:   1: !$omp end parallel do
 11: end do

Back substitution would be done next, but is not shown.

Message Passing

The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]

 1: root = 0
 2: chunk = n**2/p
 3: ! main loop
 4: do pivot = 1, n-1
 5:     ! root maintains communication
 6:     if (my_rank.eq.0) then
 7:        ! adjust the chunk size
 8:        if (MOD(pivot, p).eq.0) then
 9:           chunk = chunk - n
10:        endif
11: 
12:        ! calculate chunk vectors
13:        rem = MOD((n**2-(n*pivot)),chunk)
14:        tmp = 0
15:        do i = 1, p
16:           tmp = tmp + chunk
17:           if (tmp.le.(n**2-(n*pivot))) then
18:              a_chnk_vec(i) = chunk
19:              b_chnk_vec(i) = chunk / n
20:           else
21:              a_chnk_vec(i) = rem
22:              b_chnk_vec(i) = rem / n
23:              rem = 0
24:           endif
25:        continue
26: 
27:        ! calculate displacement vectors
28:        a_disp_vec(1) = (pivot*n)
29:        b_disp_vec(1) = pivot
30:        do i = 2, p
31:           a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32:           b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33:        continue
34:  
35:        ! fetch the pivot equation
36:        do i = 1, n
37:           pivot_eqn(i) = a(n-(i-1),pivot)
38:        continue
39:  
40:        pivot_b = b(pivot)
41:     endif ! my_rank.eq.0
42:  
43:     ! distribute the pivot equation
44:     call MPI_BCAST(pivot_eqn, n,
45:                    MPI_DOUBLE_PRECISION,
46:                    root, MPI_COMM_WORLD, ierr)
47:  
48:     call MPI_BCAST(pivot_b, 1,
49:                    MPI_DOUBLE_PRECISION,
50:                    root, MPI_COMM_WORLD, ierr)
51:  
52:     ! distribute the chunk vector
53:     call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54:                      chunk, 1, MPI_INTEGER,
55:                      root, MPI_COMM_WORLD, ierr)
56:  
57:     ! distribute the data
58:     call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59:                       MPI_DOUBLE_PRECISION,
60:                       local_a, chunk,
61:                       MPI_DOUBLE_PRECISION,
62:                       root, MPI_COMM_WORLD,ierr)
63:  
64:     call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65:                       MPI_DOUBLE_PRECISION,
66:                       local_b, chunk/n,
67:                       MPI_DOUBLE_PRECISION,
68:                       root, MPI_COMM_WORLD,ierr)
69:  
70:     ! forward elimination
71:     do j = 1, (chunk/n)
72:        xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73:        do i = (n-pivot), 1, -1
74:           local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75:        continue
76:  
77:        local_b(j) = local_b(j) - (xmult * pivot_b)
78:     continue
79:  
80:     ! restore the data to root
81:     call MPI_GATHERV(local_a, chunk,
82:                      MPI_DOUBLE_PRECISION,
83:                      a, a_chnk_vec, a_disp_vec,
84:                      MPI_DOUBLE_PRECISION,
85:                      root, MPI_COMM_WORLD, ierr)
86:  
87:     call MPI_GATHERV(local_b, chunk/n,
88:                      MPI_DOUBLE_PRECISION,
89:                      b, b_chnk_vec, b_disp_vec,
90:                      MPI_DOUBLE_PRECISION,
91:                      root, MPI_COMM_WORLD, ierr)
92:  continue ! end of main loop
93: 
94:  ! backwards substitution done in parallel (not shown)


This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk variable. chunk is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST, MPI_SCATTER, and MPI_SCATTERV serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the MPI_ functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

Definitions

  • HPF - High Performance FORTRAN
  • MPI - Message Passing Interface, an API used for supporting message passing across processes.

References

1. S.F.McGinn and R.E.Shaw, University of New Brunswick, Parallel Gaussian Elimination Using OpenMP and MPI
2. Ian Foster, Argonne National Laboratory, Case Study: Gaussian Elimination
3. MPI: A Message-Passing Interface Standard
4. Wikipedia's FORTRAN page
5. Wikipedia's Gaussian Elimination page