CSC/ECE 506 Fall 2007/wiki2 4 LA
Topic: Parallelizing an application
Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.
LAMMPS Algorithm:
The LAMMPS (Large Scale Atomic/Molecular Massively Parallel System) suite is a classical molecular dynamics code developed at Sandia National Labs, New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an open source code. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.
Sequential Algorithm:
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.
File:Pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg
Decomposition
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.
The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.
Assignment
The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.
Orchestration
For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.
The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.
double ComputeTemp::compute_scalar() { double **v = atom->v; double *mass = atom->mass; double *rmass = atom->rmass; int *type = atom->type; int *mask = atom->mask; int nlocal = atom->nlocal;
double t = 0.0;
if (mass) { for (int i = 0; i < nlocal; i++) if (mask[i] & groupbit) t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * mass[type[i]]; } else { for (int i = 0; i < nlocal; i++) if (mask[i] & groupbit) t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i]; }
MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world); if (dynamic) recount(); scalar *= tfactor; return scalar; }
As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.
void ComputeTemp::compute_vector() { int i;
double **v = atom->v; double *mass = atom->mass; double *rmass = atom->rmass; int *type = atom->type; int *mask = atom->mask; int nlocal = atom->nlocal;
double massone,t[6]; for (i = 0; i < 6; i++) t[i] = 0.0;
for (i = 0; i < nlocal; i++) if (mask[i] & groupbit) { if (mass) massone = mass[type[i]]; else massone = rmass[i]; t[0] += massone * v[i][0]*v[i][0]; t[1] += massone * v[i][1]*v[i][1]; t[2] += massone * v[i][2]*v[i][2]; t[3] += massone * v[i][0]*v[i][1]; t[4] += massone * v[i][0]*v[i][2]; t[5] += massone * v[i][1]*v[i][2]; }
MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world); for (i = 0; i < 6; i++) vector[i] *= force->mvv2e; }
Mapping
The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.
References