CSC/ECE 506 Spring 2011/ch4a bm
Overview
Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.
Gaussian Elimination
Gaussian Elimination is a common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks.[5] The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.
Forward Reduction
The first step is to reduce the equation matrix to row-echelon form. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of examples will help to illustrate row-echelon form:
1 2 -4 2 0 1 3 = 5 0 0 1 1
1 7 3 0 -4 0 0 1 10 = 0.5 0 0 0 1 6
Back Substitution
In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:
1 2 -4 2 0 1 3 = 5 0 0 1 1
Substitute 1 for the third element in equation 2, and subtract 3 from both sides:
1 2 -4 2 0 1 0 = 2 0 0 1 1
Substitute 1 for the third element and 2 for the second element in equation 1, and solve:
1 0 0 2 0 1 0 = 2 0 0 1 1
FORTRAN Background
The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array A
- Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
- All elements of an array can be set to a value by simply setting the array variable equal to a value, as in
A = 0
- A
DO
loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as inA(a:b) = A(a:b) * 2
- In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in
A(1, :) = 2
Parallel Implementations
Data Parallel
The following section of code implements Gaussian Elimination via data parallel with HPF. It is found in the book Designing and Building Parallel Programs (Online) by Ian Foster.[2]
1: subroutine gauss(n, A, X) 2: integer n 3: real A(n, n+1) 5: real X(n), Fac(n), Row(n+1) 6: integer Indx(n), Itmp(1) 7: integer i, j, k, max_indx 8: real maxval 9: 10: Indx = 0 ! Initialize mask array 11: do i = 1, n 12: Itmp = MAXLOC(ABS(A(:,i)), MASK=Indx .EQ. 0) ! find pivot 13: max_indx = Itmp(1) ! Extract pivot index 14: Indx(max_indx) = i ! Update indirection array 15: Fac = A(:,i)/A(max_indx,i) ! scale factors for column 16: Row = A(max_indx,:) ! Extract pivot row 17: 18: FORALL (j=1:n, k=i:n+1, Indx(j) .EQ. 0) ! Row Update 19: A(j,k) = A(j,k) - Fac(j) * Row(k) 20: end do 21: 22: FORALL (j=1:n) 23: A(Indx(j),:) = A(j,:) ! Row exchange 24: 25: DO j = n,1,-1 ! Back substitution 26: X(j) = A(j,n+1) / A(j,j) 27: A(1:j-1,n+1) = A(1:j-1,n+1) - A(1:j-1,j)*X(j) 28: ENDDO 29: end sub
This data parallel code works by performing operations on entire rows of the equation matrix. In the first loop, the forward reduction is performed. The variable i
is an index which keeps track of the column (and thus row) that is currently being reduced.
Instead of reducing the rows starting with row 1, we will instead work with the row that has the largest element in the pivot column that hasn't already been used. We will find the row with which to work at the beginning of each iteration. (lines 12-14)
Next, we divide every element in the current row by the value in the column we are reducing. (lines 15-16)
To finish the forward reduction, we subtract the current row from all remaining rows the needed number of times (so that the current column becomes 0). This part of the algorithm is extremely parallelizable. The FORALL
statement tells the compiler that this statement is parallelizable, and this statement implies data parallelism. (lines 18-20)
Lines 22-24 exchange rows to put the matrix in row-echelon form, and the FORALL
statement again shows that this operation is highly parallelizable.
The last loop does the back substitution. The loop starts on the end and works its way to the first row.
First, the last row is solved. (line 26) Then, the just found solution is immediately substituted into all remaining rows and added to the right side of the matrix (the n+1 column). (line 27)
The following section of code implements Gaussian Elimination with a shared memory scheme, using HPF. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]
1: do pivot = 1, (n-1) 2: !$omp parallel do private(xmult) schedule(runtime) 3: do i = (pivot+1), n 4: xmult = a(i,pivot) / a(pivot,pivot) 5: do j = (pivot+1), n 6: a(i,j) = a(i,j) - (xmult * a(pivot,j)) 7: end do 8: b(i) = b(i) - (xmult * b(pivot)) 9: end do 10: 1: !$omp end parallel do 11: end do
As is readily seen, this code is short and simple, so we will analyze it line by line. To summarize the method, each row is solved serially, with the columns each being normalized and subtracted from the remaining rows in parallel.
The pivot column is the column currently being reduced (and subsequently, also the row).
1: do pivot = 1, (n-1)
Loop through each row for forward reduction.
2: !$omp parallel do private(xmult) schedule(runtime)
Spawn parallel threads here, making xmult
private.
3: do i = (pivot+1), n
Loop through all columns in the current row starting from the pivot.
4: xmult = a(i,pivot) / a(pivot,pivot)
Divide the i'th column by the pivot column. This is the normalization step.
5: do j = (pivot+1), n 6: a(i,j) = a(i,j) - (xmult * a(pivot,j)) 7: end do
Now, we loop through all other rows and update them with the just normalized column. This involves subtracting it the needed number of times.
8: b(i) = b(i) - (xmult * b(pivot))
We also update the solution column, which is in the b
array.
9: end do 10: 1: !$omp end parallel do 11: end do
Back substitution would be done next, but is not shown.
Message Passing
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]
1: root = 0 2: chunk = n**2/p 3: ! main loop 4: do pivot = 1, n-1 5: ! root maintains communication 6: if (my_rank.eq.0) then 7: ! adjust the chunk size 8: if (MOD(pivot, p).eq.0) then 9: chunk = chunk - n 10: endif 11: 12: ! calculate chunk vectors 13: rem = MOD((n**2-(n*pivot)),chunk) 14: tmp = 0 15: do i = 1, p 16: tmp = tmp + chunk 17: if (tmp.le.(n**2-(n*pivot))) then 18: a_chnk_vec(i) = chunk 19: b_chnk_vec(i) = chunk / n 20: else 21: a_chnk_vec(i) = rem 22: b_chnk_vec(i) = rem / n 23: rem = 0 24: endif 25: continue 26: 27: ! calculate displacement vectors 28: a_disp_vec(1) = (pivot*n) 29: b_disp_vec(1) = pivot 30: do i = 2, p 31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1) 32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1) 33: continue 34: 35: ! fetch the pivot equation 36: do i = 1, n 37: pivot_eqn(i) = a(n-(i-1),pivot) 38: continue 39: 40: pivot_b = b(pivot) 41: endif ! my_rank.eq.0 42: 43: ! distribute the pivot equation 44: call MPI_BCAST(pivot_eqn, n, 45: MPI_DOUBLE_PRECISION, 46: root, MPI_COMM_WORLD, ierr) 47: 48: call MPI_BCAST(pivot_b, 1, 49: MPI_DOUBLE_PRECISION, 50: root, MPI_COMM_WORLD, ierr) 51: 52: ! distribute the chunk vector 53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER, 54: chunk, 1, MPI_INTEGER, 55: root, MPI_COMM_WORLD, ierr) 56: 57: ! distribute the data 58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec, 59: MPI_DOUBLE_PRECISION, 60: local_a, chunk, 61: MPI_DOUBLE_PRECISION, 62: root, MPI_COMM_WORLD,ierr) 63: 64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec, 65: MPI_DOUBLE_PRECISION, 66: local_b, chunk/n, 67: MPI_DOUBLE_PRECISION, 68: root, MPI_COMM_WORLD,ierr) 69: 70: ! forward elimination 71: do j = 1, (chunk/n) 72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot) 73: do i = (n-pivot), 1, -1 74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1))) 75: continue 76: 77: local_b(j) = local_b(j) - (xmult * pivot_b) 78: continue 79: 80: ! restore the data to root 81: call MPI_GATHERV(local_a, chunk, 82: MPI_DOUBLE_PRECISION, 83: a, a_chnk_vec, a_disp_vec, 84: MPI_DOUBLE_PRECISION, 85: root, MPI_COMM_WORLD, ierr) 86: 87: call MPI_GATHERV(local_b, chunk/n, 88: MPI_DOUBLE_PRECISION, 89: b, b_chnk_vec, b_disp_vec, 90: MPI_DOUBLE_PRECISION, 91: root, MPI_COMM_WORLD, ierr) 92: continue ! end of main loop 93: 94: ! backwards substitution done in parallel (not shown)
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk
variable. chunk
is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk
smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root
, which refers to the root processor, the processor that controls the rest of the processors.
The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST
, MPI_SCATTER
, and MPI_SCATTERV
serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV
function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.
The key elements of Message Passing in this code example are the communication via the MPI_
functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.
Definitions
- HPF - High Performance FORTRAN
- MPI - Message Passing Interface, an API used for supporting message passing across processes.
References
1. S.F.McGinn and R.E.Shaw, University of New Brunswick, Parallel Gaussian Elimination Using OpenMP and MPI
2. Ian Foster, Argonne National Laboratory, Case Study: Gaussian Elimination
3. MPI: A Message-Passing Interface Standard
4. Wikipedia's FORTRAN page
5. Wikipedia's Gaussian Elimination page