CSC/ECE 506 Spring 2011/ch4a bm: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 150: Line 150:


   1: subroutine gauss(n, A, X)
   1: subroutine gauss(n, A, X)
integer n
  2: integer n
real A(n, n+1)
  3: real A(n, n+1)
real X(n), Fac(n), Row(n+1)
  5: real X(n), Fac(n), Row(n+1)
integer Indx(n), Itmp(1)
  6: integer Indx(n), Itmp(1)
integer i, j, k, max_indx
  7: integer i, j, k, max_indx
real maxval
  8: real maxval
  9:
  10: Indx = 0 ! Initialize mask array
  11: do i = 1, n
  12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
  13: max_indx = Itmp(1) ! Extract pivot index
  14: Indx(max_indx) = i ! Update indirection array
  15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column
  16: Row = A(max_indx,:) ! Extract pivot row
  17:
  18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
  19: A(j,k) = A(j,k) - Fac(j) * Row(k)
  20: end do
  21:
  22: FORALL (j=1:n)
  23: A(Indx(j),:) = A(j,:) ! Row exchange
  24:
  25: DO j = n,1,-1 ! Back substitution
  26: X(j) = A(j,n+1) / A(j,j)
  27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
  28: ENDDO
  29: end sub


Indx = 0 ! Initialize mask array
do i = 1, n
Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot
max_indx = Itmp(1) ! Extract pivot index
Indx(max_indx) = i ! Update indirection array
Fac = A(:,i)/A(max_indx,i) ! scale factors for column
Row = A(max_indx,:) ! Extract pivot row
FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update
A(j,k) = A(j,k) - Fac(j) * Row(k)
end do
FORALL (j=1:n)
A(Indx(j),:) = A(j,:) ! Row exchange
DO j = n,1,-1 ! Back substitution
X(j) = A(j,n+1) / A(j,j)
A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
ENDDO
end sub
  1: subroutine gauss(n, A, X)
  2: chunk = n**2/p
  3: ! main loop
  4: do pivot = 1, n-1
  5:    ! root maintains communication
  6:    if (my_rank.eq.0) then
  7:        ! adjust the chunk size
  8:        if (MOD(pivot, p).eq.0) then
  9:          chunk = chunk - n
10:        endif
11:
12:        ! calculate chunk vectors
13:        rem = MOD((n**2-(n*pivot)),chunk)
14:        tmp = 0
15:        do i = 1, p
16:          tmp = tmp + chunk
17:          if (tmp.le.(n**2-(n*pivot))) then
18:              a_chnk_vec(i) = chunk
19:              b_chnk_vec(i) = chunk / n
20:          else
21:              a_chnk_vec(i) = rem
22:              b_chnk_vec(i) = rem / n
23:              rem = 0
24:          endif
25:        continue
26:
27:        ! calculate displacement vectors
28:        a_disp_vec(1) = (pivot*n)
29:        b_disp_vec(1) = pivot
30:        do i = 2, p
31:          a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32:          b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33:        continue
34: 
35:        ! fetch the pivot equation
36:        do i = 1, n
37:          pivot_eqn(i) = a(n-(i-1),pivot)
38:        continue
39: 
40:        pivot_b = b(pivot)
41:    endif ! my_rank.eq.0
42: 
43:    ! distribute the pivot equation
44:    call MPI_BCAST(pivot_eqn, n,
45:                    MPI_DOUBLE_PRECISION,
46:                    root, MPI_COMM_WORLD, ierr)
47: 
48:    call MPI_BCAST(pivot_b, 1,
49:                    MPI_DOUBLE_PRECISION,
50:                    root, MPI_COMM_WORLD, ierr)
51: 
52:    ! distribute the chunk vector
53:    call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54:                      chunk, 1, MPI_INTEGER,
55:                      root, MPI_COMM_WORLD, ierr)
56: 
57:    ! distribute the data
58:    call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59:                      MPI_DOUBLE_PRECISION,
60:                      local_a, chunk,
61:                      MPI_DOUBLE_PRECISION,
62:                      root, MPI_COMM_WORLD,ierr)
63: 
64:    call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65:                      MPI_DOUBLE_PRECISION,
66:                      local_b, chunk/n,
67:                      MPI_DOUBLE_PRECISION,
68:                      root, MPI_COMM_WORLD,ierr)
69: 
70:    ! forward elimination
71:    do j = 1, (chunk/n)
72:        xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73:        do i = (n-pivot), 1, -1
74:          local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75:        continue
76: 
77:        local_b(j) = local_b(j) - (xmult * pivot_b)
78:    continue
79: 
80:    ! restore the data to root
81:    call MPI_GATHERV(local_a, chunk,
82:                      MPI_DOUBLE_PRECISION,
83:                      a, a_chnk_vec, a_disp_vec,
84:                      MPI_DOUBLE_PRECISION,
85:                      root, MPI_COMM_WORLD, ierr)
86: 
87:    call MPI_GATHERV(local_b, chunk/n,
88:                      MPI_DOUBLE_PRECISION,
89:                      b, b_chnk_vec, b_disp_vec,
90:                      MPI_DOUBLE_PRECISION,
91:                      root, MPI_COMM_WORLD, ierr)
92:  continue ! end of main loop
93:
94:  ! backwards substitution done in parallel (not shown)
<br>
<br>
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory.  The code also attempts to do some load balancing via the <code>chunk</code> variable.  <code>chunk</code> is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller.  Making <code>chunk</code> smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes.  The other variable of note is <code>root</code>, which refers to the root processor, the processor that controls the rest of the processors.
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory.  The code also attempts to do some load balancing via the <code>chunk</code> variable.  <code>chunk</code> is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller.  Making <code>chunk</code> smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes.  The other variable of note is <code>root</code>, which refers to the root processor, the processor that controls the rest of the processors.

Revision as of 02:47, 28 February 2011

Overview

Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.

Gaussian Elimination

Guassian Elimination is common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks[5]. The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.

Forward Reduction

The first step is to reduce the equation matrix to row echelon form. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of example are best to illustrate:

1 2 -4 2
0 1 3 = 5
0 0 1 1


1 7 3 0 -4
0 0 1 10 = 0.5
0 0 0 1 6

Back Substitution

In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:

1 2 -4 2
0 1 3 = 5
0 0 1 1

Substitute 1 for the third element in equation 2, and subtract 3 from both sides:

1 2 -4 2
0 1 0 = 2
0 0 1 1

Substitute 1 for the third element and 2 for the second element in equation 1, and solve:

1 0 0 2
0 1 0 = 2
0 0 1 1

FORTRAN Background

The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array A

  • Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
  • All elements of an array can be set to a value by simply setting the array variable equal to a value, as in A = 0
  • A DO loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as in A(a:b) = A(a:b) * 2
  • In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in A(1, :) = 2

Parallel Implementations

Data Parallel

The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]

 1: subroutine gauss(n, A, X)
 2: 	integer n
 3: 	real A(n, n+1)
 5: 	real X(n), Fac(n), Row(n+1)
 6: 	integer Indx(n), Itmp(1)
 7: 	integer i, j, k, max_indx
 8: 	real maxval
 9: 
 10: 	Indx = 0						! Initialize mask array
 11: 	do i = 1, n
 12: 		Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0)	! find pivot
 13: 		max_indx = Itmp(1)				! Extract pivot index
 14: 		Indx(max_indx) = i				! Update indirection array
 15: 		Fac = A(:,i)/A(max_indx,i)			! scale factors for column
 16: 		Row = A(max_indx,:)				! Extract pivot row
 17: 								
 18:		FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0)		! Row Update
 19: 			A(j,k) = A(j,k) - Fac(j) * Row(k)
 20: 	end do
 21: 
 22: 	FORALL (j=1:n)
 23: 		A(Indx(j),:) = A(j,:)				! Row exchange
 24: 
 25: 	DO j = n,1,-1						! Back substitution
 26: 		X(j) = A(j,n+1) / A(j,j)
 27: 		A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j)
 28: 	ENDDO
 29: end sub


This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk variable. chunk is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST, MPI_SCATTER, and MPI_SCATTERV serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the MPI_ functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

Shared Memory

Message Passing

The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]

 1: root = 0
 2: chunk = n**2/p
 3: ! main loop
 4: do pivot = 1, n-1
 5:     ! root maintains communication
 6:     if (my_rank.eq.0) then
 7:        ! adjust the chunk size
 8:        if (MOD(pivot, p).eq.0) then
 9:           chunk = chunk - n
10:        endif
11: 
12:        ! calculate chunk vectors
13:        rem = MOD((n**2-(n*pivot)),chunk)
14:        tmp = 0
15:        do i = 1, p
16:           tmp = tmp + chunk
17:           if (tmp.le.(n**2-(n*pivot))) then
18:              a_chnk_vec(i) = chunk
19:              b_chnk_vec(i) = chunk / n
20:           else
21:              a_chnk_vec(i) = rem
22:              b_chnk_vec(i) = rem / n
23:              rem = 0
24:           endif
25:        continue
26: 
27:        ! calculate displacement vectors
28:        a_disp_vec(1) = (pivot*n)
29:        b_disp_vec(1) = pivot
30:        do i = 2, p
31:           a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1)
32:           b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1)
33:        continue
34:  
35:        ! fetch the pivot equation
36:        do i = 1, n
37:           pivot_eqn(i) = a(n-(i-1),pivot)
38:        continue
39:  
40:        pivot_b = b(pivot)
41:     endif ! my_rank.eq.0
42:  
43:     ! distribute the pivot equation
44:     call MPI_BCAST(pivot_eqn, n,
45:                    MPI_DOUBLE_PRECISION,
46:                    root, MPI_COMM_WORLD, ierr)
47:  
48:     call MPI_BCAST(pivot_b, 1,
49:                    MPI_DOUBLE_PRECISION,
50:                    root, MPI_COMM_WORLD, ierr)
51:  
52:     ! distribute the chunk vector
53:     call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER,
54:                      chunk, 1, MPI_INTEGER,
55:                      root, MPI_COMM_WORLD, ierr)
56:  
57:     ! distribute the data
58:     call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec,
59:                       MPI_DOUBLE_PRECISION,
60:                       local_a, chunk,
61:                       MPI_DOUBLE_PRECISION,
62:                       root, MPI_COMM_WORLD,ierr)
63:  
64:     call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec,
65:                       MPI_DOUBLE_PRECISION,
66:                       local_b, chunk/n,
67:                       MPI_DOUBLE_PRECISION,
68:                       root, MPI_COMM_WORLD,ierr)
69:  
70:     ! forward elimination
71:     do j = 1, (chunk/n)
72:        xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot)
73:        do i = (n-pivot), 1, -1
74:           local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1)))
75:        continue
76:  
77:        local_b(j) = local_b(j) - (xmult * pivot_b)
78:     continue
79:  
80:     ! restore the data to root
81:     call MPI_GATHERV(local_a, chunk,
82:                      MPI_DOUBLE_PRECISION,
83:                      a, a_chnk_vec, a_disp_vec,
84:                      MPI_DOUBLE_PRECISION,
85:                      root, MPI_COMM_WORLD, ierr)
86:  
87:     call MPI_GATHERV(local_b, chunk/n,
88:                      MPI_DOUBLE_PRECISION,
89:                      b, b_chnk_vec, b_disp_vec,
90:                      MPI_DOUBLE_PRECISION,
91:                      root, MPI_COMM_WORLD, ierr)
92:  continue ! end of main loop
93: 
94:  ! backwards substitution done in parallel (not shown)


This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk variable. chunk is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root, which refers to the root processor, the processor that controls the rest of the processors.

The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST, MPI_SCATTER, and MPI_SCATTERV serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.

The key elements of Message Passing in this code example are the communication via the MPI_ functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.

Definitions

  • HPF - High Performance FORTRAN
  • MPI - Message Passing Interface, an API used for supporting message passing across processes.

References

1. S.F.McGinn and R.E.Shaw, University of New Brunswick, Parallel Gaussian Elimination Using OpenMP and MPI
2. Ian Foster, Argonne National Laboratory, Case Study: Gaussian Elimination
3. MPI: A Message-Passing Interface Standard
4. Wikipedia's FORTRAN page
5. Wikipedia's Gaussian Elimination page