CSC/ECE 506 Spring 2011/ch4a bm
Overview
Many algorithms can be parallelized effectively. Some of them can even be parellelized using different parallel models. Gaussian elimination is one such algorithm. It can be implemented in the Data Parallel, Shared Memory, and Message passing models. This article discusses implementations of Gaussian elimination in all three models, using High Performance FORTRAN (HPF), OpenMP, MPI.
Gaussian Elimination
Guassian Elimination is common method used to solve a system of linear equations. The method was popularized by Issac Newton and is today taught in most elementary linear algebra textbooks[5]. The method consists of two steps: forward reduction and back substitution. The method is not strictly matrix based, but since any system of equations can be represented in matrix form, we will only work with the matrix forms for convenience.
Forward Reduction
The first step is to reduce the equation matrix to row echelon form. In this form, each row has at least one more zero in a column on the left than the previous row, and the first non-zero element is 1. A couple of example are best to illustrate:
1 2 -4 2 0 1 3 = 5 0 0 1 1
1 7 3 0 -4 0 0 1 10 = 0.5 0 0 0 1 6
Back Substitution
In this step, we begin with the last row of the matrix and substitute the result into the previous row. We solve that row and substitute into the previous row, continuing like this until the system is solved. For example, we will solve this matrix:
1 2 -4 2 0 1 3 = 5 0 0 1 1
Substitute 1 for the third element in equation 2, and subtract 3 from both sides:
1 2 -4 2 0 1 0 = 2 0 0 1 1
Substitute 1 for the third element and 2 for the second element in equation 1, and solve:
1 0 0 2 0 1 0 = 2 0 0 1 1
FORTRAN Background
The code samples below are given in FORTRAN. FORTRAN has some differences from C-based languages. They are listed below. Assume that there exists an array A
- Arrays are 1-based instead of 0-based, and array subscripts are specified using parentheses instead of brackets.
- All elements of an array can be set to a value by simply setting the array variable equal to a value, as in
A = 0
- A
DO
loop is not necessary to perform the same action on a set of items in an array. Rather, one can simply specify a subset of the array on which to perform the action, as inA(a:b) = A(a:b) * 2
- In a multi-dimensional array, using a colon as a range of array elements in one of the dimensions will perform the operation on all elements in that dimension, as in
A(1, :) = 2
Parallel Implementations
Data Parallel
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]
1: root = 0 2: chunk = n**2/p 3: ! main loop 4: do pivot = 1, n-1 5: ! root maintains communication 6: if (my_rank.eq.0) then 7: ! adjust the chunk size 8: if (MOD(pivot, p).eq.0) then 9: chunk = chunk - n 10: endif 11: 12: ! calculate chunk vectors 13: rem = MOD((n**2-(n*pivot)),chunk) 14: tmp = 0 15: do i = 1, p 16: tmp = tmp + chunk 17: if (tmp.le.(n**2-(n*pivot))) then 18: a_chnk_vec(i) = chunk 19: b_chnk_vec(i) = chunk / n 20: else 21: a_chnk_vec(i) = rem 22: b_chnk_vec(i) = rem / n 23: rem = 0 24: endif 25: continue 26: 27: ! calculate displacement vectors 28: a_disp_vec(1) = (pivot*n) 29: b_disp_vec(1) = pivot 30: do i = 2, p 31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1) 32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1) 33: continue 34: 35: ! fetch the pivot equation 36: do i = 1, n 37: pivot_eqn(i) = a(n-(i-1),pivot) 38: continue 39: 40: pivot_b = b(pivot) 41: endif ! my_rank.eq.0 42: 43: ! distribute the pivot equation 44: call MPI_BCAST(pivot_eqn, n, 45: MPI_DOUBLE_PRECISION, 46: root, MPI_COMM_WORLD, ierr) 47: 48: call MPI_BCAST(pivot_b, 1, 49: MPI_DOUBLE_PRECISION, 50: root, MPI_COMM_WORLD, ierr) 51: 52: ! distribute the chunk vector 53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER, 54: chunk, 1, MPI_INTEGER, 55: root, MPI_COMM_WORLD, ierr) 56: 57: ! distribute the data 58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec, 59: MPI_DOUBLE_PRECISION, 60: local_a, chunk, 61: MPI_DOUBLE_PRECISION, 62: root, MPI_COMM_WORLD,ierr) 63: 64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec, 65: MPI_DOUBLE_PRECISION, 66: local_b, chunk/n, 67: MPI_DOUBLE_PRECISION, 68: root, MPI_COMM_WORLD,ierr) 69: 70: ! forward elimination 71: do j = 1, (chunk/n) 72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot) 73: do i = (n-pivot), 1, -1 74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1))) 75: continue 76: 77: local_b(j) = local_b(j) - (xmult * pivot_b) 78: continue 79: 80: ! restore the data to root 81: call MPI_GATHERV(local_a, chunk, 82: MPI_DOUBLE_PRECISION, 83: a, a_chnk_vec, a_disp_vec, 84: MPI_DOUBLE_PRECISION, 85: root, MPI_COMM_WORLD, ierr) 86: 87: call MPI_GATHERV(local_b, chunk/n, 88: MPI_DOUBLE_PRECISION, 89: b, b_chnk_vec, b_disp_vec, 90: MPI_DOUBLE_PRECISION, 91: root, MPI_COMM_WORLD, ierr) 92: continue ! end of main loop 93: 94: ! backwards substitution done in parallel (not shown)
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk
variable. chunk
is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk
smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root
, which refers to the root processor, the processor that controls the rest of the processors.
The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST
, MPI_SCATTER
, and MPI_SCATTERV
serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV
function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.
The key elements of Message Passing in this code example are the communication via the MPI_
functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.
Message Passing
The following section of code implements Gaussian Elimination via message passing, using MPI. It was taken from a paper by S.F.McGinn and R.E.Shaw from the University of New Brunswick, New Brunswick, Canada.[1]
1: root = 0 2: chunk = n**2/p 3: ! main loop 4: do pivot = 1, n-1 5: ! root maintains communication 6: if (my_rank.eq.0) then 7: ! adjust the chunk size 8: if (MOD(pivot, p).eq.0) then 9: chunk = chunk - n 10: endif 11: 12: ! calculate chunk vectors 13: rem = MOD((n**2-(n*pivot)),chunk) 14: tmp = 0 15: do i = 1, p 16: tmp = tmp + chunk 17: if (tmp.le.(n**2-(n*pivot))) then 18: a_chnk_vec(i) = chunk 19: b_chnk_vec(i) = chunk / n 20: else 21: a_chnk_vec(i) = rem 22: b_chnk_vec(i) = rem / n 23: rem = 0 24: endif 25: continue 26: 27: ! calculate displacement vectors 28: a_disp_vec(1) = (pivot*n) 29: b_disp_vec(1) = pivot 30: do i = 2, p 31: a_disp_vec(i) = a_disp_vec(i-1) + a_chnk_vec(i-1) 32: b_disp_vec(i) = b_disp_vec(i-1) + b_chnk_vec(i-1) 33: continue 34: 35: ! fetch the pivot equation 36: do i = 1, n 37: pivot_eqn(i) = a(n-(i-1),pivot) 38: continue 39: 40: pivot_b = b(pivot) 41: endif ! my_rank.eq.0 42: 43: ! distribute the pivot equation 44: call MPI_BCAST(pivot_eqn, n, 45: MPI_DOUBLE_PRECISION, 46: root, MPI_COMM_WORLD, ierr) 47: 48: call MPI_BCAST(pivot_b, 1, 49: MPI_DOUBLE_PRECISION, 50: root, MPI_COMM_WORLD, ierr) 51: 52: ! distribute the chunk vector 53: call MPI_SCATTER(a_chnk_vec, 1, MPI_INTEGER, 54: chunk, 1, MPI_INTEGER, 55: root, MPI_COMM_WORLD, ierr) 56: 57: ! distribute the data 58: call MPI_SCATTERV(a, a_chnk_vec, a_disp_vec, 59: MPI_DOUBLE_PRECISION, 60: local_a, chunk, 61: MPI_DOUBLE_PRECISION, 62: root, MPI_COMM_WORLD,ierr) 63: 64: call MPI_SCATTERV(b, b_chnk_vec, b_disp_vec, 65: MPI_DOUBLE_PRECISION, 66: local_b, chunk/n, 67: MPI_DOUBLE_PRECISION, 68: root, MPI_COMM_WORLD,ierr) 69: 70: ! forward elimination 71: do j = 1, (chunk/n) 72: xmult = local_a((n-(pivot-1)),j) / pivot_eqn(pivot) 73: do i = (n-pivot), 1, -1 74: local_a(i,j) = local_a(i,j) - (xmult * pivot_eqn(n-(i-1))) 75: continue 76: 77: local_b(j) = local_b(j) - (xmult * pivot_b) 78: continue 79: 80: ! restore the data to root 81: call MPI_GATHERV(local_a, chunk, 82: MPI_DOUBLE_PRECISION, 83: a, a_chnk_vec, a_disp_vec, 84: MPI_DOUBLE_PRECISION, 85: root, MPI_COMM_WORLD, ierr) 86: 87: call MPI_GATHERV(local_b, chunk/n, 88: MPI_DOUBLE_PRECISION, 89: b, b_chnk_vec, b_disp_vec, 90: MPI_DOUBLE_PRECISION, 91: root, MPI_COMM_WORLD, ierr) 92: continue ! end of main loop 93: 94: ! backwards substitution done in parallel (not shown)
This code lacks some of the declarations for the variables, but most of the variables are self-explanatory. The code also attempts to do some load balancing via the chunk
variable. chunk
is also used to determine how much data to send, as the amount of data needed in each step gets progressively smaller. Making chunk
smaller will therefor decrease the amount of time spent in communication, thus yielding better runtimes. The other variable of note is root
, which refers to the root processor, the processor that controls the rest of the processors.
The code effectively begins its parallel section at line 4. Lines 5-41 have the root processor setting the chunk size and setting up the data to be passed to the other processors. In lines 43-68, the root processor sends the necessary data to the other processors. The functions MPI_BCAST
, MPI_SCATTER
, and MPI_SCATTERV
serve as either a "send" or a "receive", depending on which processor is executing them; on the root, they act as a send, while on all other processors, they act as a receive[3]. In lines 70-78, each processor is performing the forward elimination on its chunk of data. Finally, the data from each processor is sent back to the root processor using the MPI_GATHERV
function, which also functions as either a "send" or a receive", only the root processor is now the receiver and the other processors are the senders. All of this code is executed for each pivot point in the matrix. Backwards substitution is then done sequentially on the root processor.
The key elements of Message Passing in this code example are the communication via the MPI_
functions and the root processor performing some set-up of data to be passed on its own. This code is using the MPI library to support parallelization.
Definitions
- HPF - High Performance FORTRAN
- MPI - Message Passing Interface, an API used for supporting message passing across processes.
References
1. S.F.McGinn and R.E.Shaw, University of New Brunswick, Parallel Gaussian Elimination Using OpenMP and MPI
2. Ian Foster, Argonne National Laboratory, Case Study: Gaussian Elimination
3. MPI: A Message-Passing Interface Standard
4. Wikipedia's FORTRAN page
5. Wikipedia's Gaussian Elimination page