Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:58:58Z

Laaboue: /* '''References''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [http://en.wikipedia.org/wiki/Atom atomic], [http://en.wikipedia.org/wiki/Polymer polymeric], [http://en.wikipedia.org/wiki/Biology biological], [http://en.wikipedia.org/wiki/Metal metallic], or [http://en.wikipedia.org/wiki/Granular_material granular] systems using a variety of [http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [http://www.spec.org/mpi2007/ SPEC MPI 2007] package used to [http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark] systems using the [http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [http://en.wikipedia.org/wiki/Velocity velocity] and [http://en.wikipedia.org/wiki/Temperature temperature]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of [http://en.wikipedia.org/wiki/Concurrency_%28computer_science%29 concurrency] just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such [http://en.wikipedia.org/wiki/Granularity fine granularity] imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS suite is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum [http://en.wikipedia.org/wiki/Speedup speedup] of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal size boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled by boundary condition calculators. Atoms and molecules in the system can be mobile and they can move across boxes. Such activity triggers an exchange function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of [http://en.wikipedia.org/wiki/Load_balancing_(computing) "Load Imbalance"] that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different [http://en.wikipedia.org/wiki/Duration timescales] for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snippet of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the [http://en.wikipedia.org/wiki/Kinetic_energy kinetic energy] values from all the processors are summed and a value is derived for the whole domain. Note the recount function near the end of the snippet where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snippet of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains. This gives us a hint that multiple message sizes are used throughout the LAMMPS suite.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is [http://en.wikipedia.org/wiki/Scalability scalable], it does not guarantee [http://en.wikipedia.org/wiki/Load_balancing_%28computing%29 load balancing].

==='''References'''===

1: [http://lammps.sandia.gov/ LAMMPS website] 
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm] 
3: [http://www.spec.org Standard Performance Evaluation Corporation (SPEC)] 
4: [http://www.spec.org/mpi2007/ SPEC MPI 2007] 
5: S. J. Plimpton, Fast Parallel Algorithms for Short-Range Molecular Dynamics, J Comp Phys, 117, 1-19 (1995). 
6: S. J. Plimpton, R. Pollock, M. Stevens, Particle-Mesh Ewald and rRESPA for Parallel Molecular Dynamics Simulations, in Proc of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, Minneapolis, MN (March 1997).

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:52:56Z

Laaboue: /* '''Mapping''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [http://en.wikipedia.org/wiki/Atom atomic], [http://en.wikipedia.org/wiki/Polymer polymeric], [http://en.wikipedia.org/wiki/Biology biological], [http://en.wikipedia.org/wiki/Metal metallic], or [http://en.wikipedia.org/wiki/Granular_material granular] systems using a variety of [http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [http://www.spec.org/mpi2007/ SPEC MPI 2007] package used to [http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark] systems using the [http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [http://en.wikipedia.org/wiki/Velocity velocity] and [http://en.wikipedia.org/wiki/Temperature temperature]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of [http://en.wikipedia.org/wiki/Concurrency_%28computer_science%29 concurrency] just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such [http://en.wikipedia.org/wiki/Granularity fine granularity] imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS suite is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum [http://en.wikipedia.org/wiki/Speedup speedup] of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal size boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled by boundary condition calculators. Atoms and molecules in the system can be mobile and they can move across boxes. Such activity triggers an exchange function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of [http://en.wikipedia.org/wiki/Load_balancing_(computing) "Load Imbalance"] that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different [http://en.wikipedia.org/wiki/Duration timescales] for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snippet of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the [http://en.wikipedia.org/wiki/Kinetic_energy kinetic energy] values from all the processors are summed and a value is derived for the whole domain. Note the recount function near the end of the snippet where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snippet of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains. This gives us a hint that multiple message sizes are used throughout the LAMMPS suite.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is [http://en.wikipedia.org/wiki/Scalability scalable], it does not guarantee [http://en.wikipedia.org/wiki/Load_balancing_%28computing%29 load balancing].

==='''References'''===

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:50:21Z

Laaboue: /* '''Orchestration''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [http://en.wikipedia.org/wiki/Atom atomic], [http://en.wikipedia.org/wiki/Polymer polymeric], [http://en.wikipedia.org/wiki/Biology biological], [http://en.wikipedia.org/wiki/Metal metallic], or [http://en.wikipedia.org/wiki/Granular_material granular] systems using a variety of [http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [http://www.spec.org/mpi2007/ SPEC MPI 2007] package used to [http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark] systems using the [http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [http://en.wikipedia.org/wiki/Velocity velocity] and [http://en.wikipedia.org/wiki/Temperature temperature]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of [http://en.wikipedia.org/wiki/Concurrency_%28computer_science%29 concurrency] just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such [http://en.wikipedia.org/wiki/Granularity fine granularity] imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS suite is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum [http://en.wikipedia.org/wiki/Speedup speedup] of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal size boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled by boundary condition calculators. Atoms and molecules in the system can be mobile and they can move across boxes. Such activity triggers an exchange function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of [http://en.wikipedia.org/wiki/Load_balancing_(computing) "Load Imbalance"] that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different [http://en.wikipedia.org/wiki/Duration timescales] for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snippet of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the [http://en.wikipedia.org/wiki/Kinetic_energy kinetic energy] values from all the processors are summed and a value is derived for the whole domain. Note the recount function near the end of the snippet where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snippet of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains. This gives us a hint that multiple message sizes are used throughout the LAMMPS suite.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:44:17Z

Laaboue: /* '''Assignment''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [http://en.wikipedia.org/wiki/Atom atomic], [http://en.wikipedia.org/wiki/Polymer polymeric], [http://en.wikipedia.org/wiki/Biology biological], [http://en.wikipedia.org/wiki/Metal metallic], or [http://en.wikipedia.org/wiki/Granular_material granular] systems using a variety of [http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [http://www.spec.org/mpi2007/ SPEC MPI 2007] package used to [http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark] systems using the [http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [http://en.wikipedia.org/wiki/Velocity velocity] and [http://en.wikipedia.org/wiki/Temperature temperature]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of [http://en.wikipedia.org/wiki/Concurrency_%28computer_science%29 concurrency] just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such [http://en.wikipedia.org/wiki/Granularity fine granularity] imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS suite is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum [http://en.wikipedia.org/wiki/Speedup speedup] of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal size boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled by boundary condition calculators. Atoms and molecules in the system can be mobile and they can move across boxes. Such activity triggers an exchange function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of [http://en.wikipedia.org/wiki/Load_balancing_(computing) "Load Imbalance"] that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:42:25Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [http://en.wikipedia.org/wiki/Atom atomic], [http://en.wikipedia.org/wiki/Polymer polymeric], [http://en.wikipedia.org/wiki/Biology biological], [http://en.wikipedia.org/wiki/Metal metallic], or [http://en.wikipedia.org/wiki/Granular_material granular] systems using a variety of [http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [http://www.spec.org/mpi2007/ SPEC MPI 2007] package used to [http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark] systems using the [http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [http://en.wikipedia.org/wiki/Velocity velocity] and [http://en.wikipedia.org/wiki/Temperature temperature]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of [http://en.wikipedia.org/wiki/Concurrency_%28computer_science%29 concurrency] just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such [http://en.wikipedia.org/wiki/Granularity fine granularity] imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS suite is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum [http://en.wikipedia.org/wiki/Speedup speedup] of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:38:09Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of atoms or molecules in a solid, liquid or gaseous material. It can model [[http://en.wikipedia.org/wiki/Atom atomic]], [[http://en.wikipedia.org/wiki/Polymer polymeric]], [[http://en.wikipedia.org/wiki/Biology biological]], [[http://en.wikipedia.org/wiki/Metal metallic]], or [[http://en.wikipedia.org/wiki/Granular_material granular]] systems using a variety of [[http://en.wikipedia.org/wiki/Force_field_%28physics%29 force fields]] and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the [[http://www.spec.org/mpi2007/ SPEC MPI 2007]] package used to [[http://en.wikipedia.org/wiki/Benchmark_%28computing%29 benchmark]] systems using the [[http://en.wikipedia.org/wiki/Message_Passing_Interface Message-Passing Interface]]. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like initial [[http://en.wikipedia.org/wiki/Velocity velocity]] and [[http://en.wikipedia.org/wiki/Temperature temperature]]. Once the initialization is completed the various required parameters are calculated (the flow chart below shows force/energy as an example). After the parameters are calculated the necessary boundary conditions are applied and the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed and presented using visualization schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:26:38Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

<center>http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg</center>

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:25:11Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:24:35Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:24:16Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:http://pg.ece.ncsu.edu/mediawiki/images/6/6a/FlowChart.jpg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

File:FlowChart.jpg

2007-09-25T02:23:21Z

Laaboue: Sequential algorithm for the sequential molecular dynamics simulation.

Sequential algorithm for the sequential molecular dynamics simulation.

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:20:43Z

Laaboue: /* '''Orchestration''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring atoms. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain. Note the recount function near the end of the snipit where the atoms are recounted. This happens if the atoms are mobile. When the atoms are mobile, it is possible for some of the atoms to move between boxes, which triggers the exchange function between processors.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:02:26Z

Laaboue: /* '''Decomposition''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P). The LAMMPS suite is documented to have a maximum speedup of 7.5 to 8 versus a similar sequential molecular dynamics suite.

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T02:01:16Z

Laaboue: /* '''Assignment''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P).

==='''Assignment'''===

The LAMMPS suite takes the following approach in assigning the tasks defined above. The atoms and molecules in the system (known as World in the LAMMPS suite) are divided spatially into equal sizes boxes. Each box is assigned a processor. Such division minimizes the overhead of communication since most of the atom interactions occur within a box. There still communication between atoms that occur at the borders and such communication is handled as a boundary condition. Atoms and molecules in the system can be mobil and they can move across boxes. Such activity triggers the "Exchange" function that transfers ownership of atoms from one processor to another. Also, since the division of atoms is spatial, some boxes might be saturated with atoms while other boxes barely have any atoms. This in turn is a cause of "Load Imbalance" that can occur in this suite.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T01:47:42Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition'''===
The LAMMPS suite provides two levels of concurrency just like the ocean problem. The function parallelism is performed across a grid where the parameters like force, energy, temperature and pressure of the atom are computed. These computations are independent per atom. Each atom can be computed by one processor but such fine granularity imposes heavy cost on communication since the computation for each atom depends on neighboring atoms.

The LAMMPS suite decomposes the domain into a set of equal sized boxes. Since near by atoms are placed on the same processor, only neighboring atoms on different processors need to be communicated. The decomposition of the LAMMPS algorithm is spatial and the computation cost is of O(N/P) and the communication cost is of O(N/P).

==='''Assignment'''===

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T00:09:45Z

Laaboue: /* '''LAMMPS Algorithm:''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS ('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) suite is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensamble of particles in a solid, liquid or gaseous state. It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code]. It is is an integral part of the SPEC MPI 2007 package used to benchmark systems using the Message-Passing Interface. LAMMPS was created in 2003.

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition & Assignment'''===
The LAMMPS algorithm provides two levels of concurrency in a single time step just like the ocean problem. The function parallelism is performed across the grid where the parameters like the force, energy, temperature , pressure etc of the atom is computed. The data parallelism is performed for the function but with different data sets.

The LAMMPS algorithm decompose domain into a set of equal sized boxes. Since nearby atoms are placed on same processor, only neighboring atoms on different processor need to be communicated. The decomposition of the LAMMPS algorithm is spatial & the computation cost is of O(N/P) & the communication cost is of O(N/P). It should be noted that there is a possibility of load imbalance as the domain is decomposed into equal size boxes.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 4 LA

2007-09-25T00:03:20Z

Laaboue: /* '''Mapping''' */

'''Topic: Parallelizing an application'''

Pick another parallel application, not covered in the text, and less than 7 years old, and describe the various steps in parallelizing it (decomposition, assignment, orchestration, and mapping). You may use an example from the peer-reviewed literature, or a Web page. You do not have to go into great detail, but you should describe enough about these four stages to make the algorithm interesting.

=='''LAMMPS Algorithm:''' ==

The LAMMPS('''L'''arge Scale '''A'''tomic/'''M'''olecular '''M'''assively '''P'''arallel '''S'''ystem) algorithm is a classical [http://en.wikipedia.org/wiki/Molecular_dynamics molecular dynamics] code developed at [http://www.sandia.gov/index.html Sandia National Labs], New Mexico. This algorithm models the ensemble of particles in a solid, liquid or gaseous state.It can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions and can be easily modified and extended. LAMMPS is distributed as an [http://lammps.sandia.gov/download.html open source code].

==='''Sequential Algorithm:'''===
This algorithm is performed for every atoms.The initialization step sets up the various parameters for the atom like number of particles, initial velocity, temperature etc. Once the initialization is completed the various required parameters are calculated(In this flow chart the force/energy is given as an example). After the parameters are calculated the necessary boundary conditions are applied & the atom schemes are integrated to get the desired results. This step is repeated for all the atom schemes. After the results for all the atom schemes are completed the results are analyzed & presented using visualizations schemes for further study.

[[Image:FlowChart.jpeg]]

==='''Decomposition & Assignment'''===
The LAMMPS algorithm provides two levels of concurrency in a single time step just like the ocean problem. The function parallelism is performed across the grid where the parameters like the force, energy, temperature , pressure etc of the atom is computed. The data parallelism is performed for the function but with different data sets.

The LAMMPS algorithm decompose domain into a set of equal sized boxes. Since nearby atoms are placed on same processor, only neighboring atoms on different processor need to be communicated. The decomposition of the LAMMPS algorithm is spatial & the computation cost is of O(N/P) & the communication cost is of O(N/P). It should be noted that there is a possibility of load imbalance as the domain is decomposed into equal size boxes.

==='''Orchestration''' ===

For computational efficiency LAMMPS uses neighbor lists to keep track of the neighboring particles. The lists are optimized for systems with particles that are repulsive at short distances, so that the local density of particles never becomes too large. Communication is also minimized to optimal level by replicating force computations of boundary atoms. To increase computational efficiency the algorithm uses different timescales for different force computations. On parallel machines, LAMMPS uses spatial-decomposition techniques to partition the simulation domain into small three-dimensional sub-domains, one of which is assigned to each processor.

The following snipit of code shows how the temperature is calculated for each sub-domain. A sub-domain is owned by a single processor and that constitutes a task. The temperature of the sub-domain is derived from the kinetic energy of each atom in the sub-domain. For each atom in the sub-domain, the square of the velocity in each dimension is accumulated, then multiplied by the mass of the atom. The final accumulated value of the sub-domain is sent to the root processor of the world using the MPI_Allreduce function. There, all the values are summed and a value is derived for the whole domain.

double ComputeTemp::compute_scalar()
{
double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double t = 0.0;

if (mass) {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) *
mass[type[i]];
} else {
for (int i = 0; i < nlocal; i++)
if (mask[i] & groupbit)
t += (v[i][0]*v[i][0] + v[i][1]*v[i][1] + v[i][2]*v[i][2]) * rmass[i];
}

MPI_Allreduce(&t,&scalar,1,MPI_DOUBLE,MPI_SUM,world);
if (dynamic) recount();
scalar *= tfactor;
return scalar;
}

As for this snipit of code, it shows how the atoms in this sub-domain affect the atoms in the six neighboring sub-domains. The kinetic energy of each atom (affecting each neighbor) in this sub-domain are accumulated. The vector of values are sent to the root processor of the World. At the root, the vectors from all the sub-domain are accumulated and sent back to the sending processors. We can see that the MPI_Allreduce function was passed six items of a vector, which constitute six neighboring sub-domains.

void ComputeTemp::compute_vector()
{
int i;

double **v = atom->v;
double *mass = atom->mass;
double *rmass = atom->rmass;
int *type = atom->type;
int *mask = atom->mask;
int nlocal = atom->nlocal;

double massone,t[6];
for (i = 0; i < 6; i++) t[i] = 0.0;

for (i = 0; i < nlocal; i++)
if (mask[i] & groupbit) {
if (mass) massone = mass[type[i]];
else massone = rmass[i];
t[0] += massone * v[i][0]*v[i][0];
t[1] += massone * v[i][1]*v[i][1];
t[2] += massone * v[i][2]*v[i][2];
t[3] += massone * v[i][0]*v[i][1];
t[4] += massone * v[i][0]*v[i][2];
t[5] += massone * v[i][1]*v[i][2];
}

MPI_Allreduce(t,vector,6,MPI_DOUBLE,MPI_SUM,world);
for (i = 0; i < 6; i++) vector[i] *= force->mvv2e;
}

==='''Mapping''' ===

The LAMMPS suite utilizes the Message-Passing parallel computing model. This implies that each processor has a copy of all the data. It performs its operations and it sends, receives and broadcasts data as necessary. The LAMMPS suite defines a Universe where all the processors belong. The LAMMPS suite defines a number of Worlds in case different unrelated simulations should run. However, if all the processors available are used to tackle a single problem, then the Universe is said to contain one World. Each processor has its own copy of the LAMMPS suite and it knows some information about the Universe such as its processor ID, the number of processors in the Universe, the World it belongs to, the number of processors in its world and the total number of worlds. Each processor has information about all the atoms and molecules in its sub-domain and their count. In each world, there exists a processor, which is called the Root processor. Also, the Message-Passing interface is defined for each processor to enable it to communicate with its six neighboring processors in its three-dimensional world. Each processor works within its own three-dimensional box, where it is responsible for a collection of atoms. The LAMMPS suite divides the atoms among processors spatially in what is called "Spatial Decomposition." Each processor operates within a box and each processor owns a box of the same size as the other processors. Although this design decision is scalable, it does not guarantee load balancing.

'''References'''

1: [http://lammps.sandia.gov/ LAMMPS website]
2: [http://etd.lib.fsu.edu/theses/available/etd-07122004-165317/unrestricted/02_JK_RestThesis.pdf Sequential algorithm]

CSC/ECE 506 Fall 2007/wiki2 3 pa

2007-09-24T06:26:42Z

Laaboue: /* '''Orchestration''' */

CSC/ECE 506 Fall 2007/wiki2 3 pa

2007-09-24T06:25:01Z

Laaboue: /* '''Orchestration''' */

CSC/ECE 506 Fall 2007/wiki2 3 pa

2007-09-24T06:23:33Z

Laaboue: /* '''Orchestration''' */

CSC/ECE 506 Fall 2007/wiki2 3 pa

2007-09-24T03:32:14Z

Laaboue: /* '''Mapping''' */

CSC/ECE 506 Fall 2007/wiki2 3 pa

2007-09-24T02:37:02Z

Laaboue: /* '''Mapping''' */

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-10T21:17:56Z

Laaboue:

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

Up to 1986, advancements in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the established norm has been 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned for use in microprocessors. However, [http://en.wikipedia.org/wiki/Graphics_processing_unit graphics processors (GPU)] have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

''[http://en.wikipedia.org/wiki/Instruction_level_parallelism Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache], [http://en.wikipedia.org/wiki/Speculative_execution speculative execution], [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores (from AMD and Sun respectively). It is foreseen to see sixteen cores on a single chip in a matter of months.

===Clock Speed and Parallelism===

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips.

===Instruction Sets and Parallelism===

Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with [http://en.wikipedia.org/wiki/Explicitly_Parallel_Instruction_Computing Explicit Parallel Instruction Computing (EPIC)]. This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for [http://en.wikipedia.org/wiki/Multiprocessing multiprocessors] was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.

In 1999, the [http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Streaming SIMD Extensions (SSE)] instruction set was introduced by Intel. This instruction set added eight new 128 bit registers and 70 floating point instructions. In 2000, Intel added a complete complement of integer instructions and 64-bit SIMD floating point instructions to the original SSE registers when they introduced the SSE2 instruction set. In 2004, a revision of Intel's Pentium 4 processor introduced the SSE3 instruction set. This instruction set added specific memory and thread-handling instructions, which improved the performance of Intel's HyperThreading technology.

In an attempt to keep pace with Intel, AMD licensed the SSE3 instruction set and implemented most of its instructions in particular Athlon 64 processors. In the summer of 2007, AMD introduced a new extension of the x86 instruction set: [http://developer.amd.com/sse5.jsp SSE5]. This extension was designed to increase application efficiency and performance by allowing software developers to simplify code and by providing them with additional capabilities.

===Silicon Technologies===

In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved a significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip, among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay, thus continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law].

==System Design Trends==

System design has become a very diverse field. There are systems that utilize a single backplane which supports a small amount of microprocessors. Although the number of microprocessors has slowly been inching up, such a technology has been limited to desktops and workstations. Larger loads of work need more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple companies are persuing it, such as IBM and HP. Other companies pursued different technologies, such as [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA] and [http://en.wikipedia.org/wiki/Blade_server blade servers], for tight clustering. Larger clusters utilize computer-to-computer links, such as [http://en.wikipedia.org/wiki/Infiniband Infiniband]. Such clusters enter the realm of [http://en.wikipedia.org/wiki/Supercomputer supercomputing], which deserve their [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_5_1008 own topic].

===PC Direction===

The number of supported microprocessors in a computer is ever increasing. Since the mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

===Server Direction===

Figure 1 shows the number of processors that have been supported in a shared bus (for the past decade). A commonality between the technology appearing this decade and in the last decade is that servers throughout these decades supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center>
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center>
 

A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors.

===Shared Memory Bus Direction===

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center>
 
<center>Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.</center>

==References==
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999. 
http://compoundsemiconductor.net/articles/news/11/1/25 
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html 
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html 
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF 
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf 
http://www.endian.net/details.aspx?ItemNo=655 
http://www.hpcwire.com/hpc/1754487.html 
http://www.hypertransport.org/ 
http://www.mbipr.com/whitepaper5.pdf 
http://www.sgi.com/products/remarketed/offering.html 
http://www.sun.com/processors/ 
http://www.theinquirer.net/?article=9235

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-10T01:24:51Z

Laaboue: /* System Design Trends */

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

The textbook discusses that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

''[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_4_la Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache] on chip, [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores. It is foreseen to see sixteen cores on a single chip in a matter of months.

===Clock Speed and Parallelism===

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips.

===Instruction Sets and Parallelism===

Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with the [http://en.wikipedia.org/wiki/Explicitly_Parallel_Instruction_Computing Explicit Parallel Instruction Computing (EPIC)]. This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for multiprocessors was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.

===Silicon Technologies===

In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor which was designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law].

==System Design Trends==

System design has become a very diverse field. There are systems that utilize a single backplane that supports a small amount of microprocessors. Such a number has been inching up and such a technology has been limited to desktops and workstations. Larger loads of work needed more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing so many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple of companies are persuing it such as IBM and HP. Other companies pursued different technologies such as ccNUMA and blade servers for tight clustering. Larger clusters utilize computer-to-computer links such as [http://en.wikipedia.org/wiki/Infiniband Infiniband]. Such clusters enter the realm of [http://en.wikipedia.org/wiki/Supercomputer supercomputing], which deserve their own topic.

===PC Direction===

The number of supported microprocessors in a computer is ever increasing. Since mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

===Server Direction===

Figure 1 shows the number of processors that have been supported in a shared bus this decade. A commonality between the technology appearing this decade and in the last decade is that servers at these times supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center>
 
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center>

A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors.

===Shared Memory Bus Direction===

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center>
 
<center>Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.</center>

==References==
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999. 
http://compoundsemiconductor.net/articles/news/11/1/25 
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html 
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html 
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf 
http://www.endian.net/details.aspx?ItemNo=655 
http://www.hypertransport.org/ 
http://www.mbipr.com/whitepaper5.pdf 
http://www.sgi.com/products/remarketed/offering.html 
http://www.sun.com/processors/ 
http://www.theinquirer.net/?article=9235
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-10T01:22:41Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

The textbook discusses that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

''[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_4_la Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache] on chip, [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores. It is foreseen to see sixteen cores on a single chip in a matter of months.

===Clock Speed and Parallelism===

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips.

===Instruction Sets and Parallelism===

Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with the [http://en.wikipedia.org/wiki/Explicitly_Parallel_Instruction_Computing Explicit Parallel Instruction Computing (EPIC)]. This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for multiprocessors was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.

===Silicon Technologies===

In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor which was designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law].

==System Design Trends==

System design has become a very diverse field. There are systems that utilize a single backplane that supports a small amount of microprocessors. Such a number has been inching up and such a technology has been limited to desktops and workstations. Larger loads of work needed more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing so many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple of companies are persuing it such as IBM and HP. Other companies pursued different technologies such as ccNUMA and blade servers for tight clustering. Larger clusters utilize computer-to-computer links such as Infiniband. Such clusters enter the realm of [http://en.wikipedia.org/wiki/Supercomputer supercomputing], which deserve their own topic.

===PC Direction===

The number of supported microprocessors in a computer is ever increasing. Since mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

===Server Direction===

Figure 1 shows the number of processors that have been supported in a shared bus this decade. A commonality between the technology appearing this decade and in the last decade is that servers at these times supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center>
 
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center>

A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors.

===Shared Memory Bus Direction===

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center>
 
<center>Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.</center>

==References==
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999. 
http://compoundsemiconductor.net/articles/news/11/1/25 
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html 
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html 
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf 
http://www.endian.net/details.aspx?ItemNo=655 
http://www.hypertransport.org/ 
http://www.mbipr.com/whitepaper5.pdf 
http://www.sgi.com/products/remarketed/offering.html 
http://www.sun.com/processors/ 
http://www.theinquirer.net/?article=9235
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-10T01:05:18Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

The textbook discusses that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

''[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_4_la Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache] on chip, [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores. It is foreseen to see sixteen cores on a single chip in a matter of months.

===Clock Speed and Parallelism===

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips.

===Instruction Sets and Parallelism===

Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with the [http://en.wikipedia.org/wiki/Explicitly_Parallel_Instruction_Computing Explicit Parallel Instruction Computing (EPIC)]. This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for multiprocessors was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.

===Silicon Technologies===

In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor which was designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law].

==System Design Trends==

===PC Direction===

The number of supported microprocessors in a computer is ever increasing. Since mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

===Server Direction===

Figure 1 shows the number of processors that have been supported in a shared bus this decade. A commonality between the technology appearing this decade and in the last decade is that servers at these times supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center>
 
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center>

A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors.

===Shared Memory Bus Direction===

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center>
 
<center>Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.</center>

==References==
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999. 
http://compoundsemiconductor.net/articles/news/11/1/25 
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html 
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html 
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf 
http://www.endian.net/details.aspx?ItemNo=655 
http://www.hypertransport.org/ 
http://www.mbipr.com/whitepaper5.pdf 
http://www.sgi.com/products/remarketed/offering.html 
http://www.sun.com/processors/ 
http://www.theinquirer.net/?article=9235
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-09T06:14:31Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

The textbook discusses that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

''[http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_4_la Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache] on chip, [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores. It is foreseen to see sixteen cores on a single chip in a matter of months.

===Clock Speed and Parallelism===

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips.

===Silicon Technologies===

In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor which was designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law].

==System Design Trends==

===PC Direction===

The number of supported microprocessors in a computer is ever increasing. Since mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

===Server Direction===

Figure 1 shows the number of processors that have been supported in a shared bus this decade. A commonality between the technology appearing this decade and in the last decade is that servers at these times supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center>
 
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center>

A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors.

===Shared Memory Bus Direction===

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center>
 
<center>Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.</center>

==References==
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999. 
http://compoundsemiconductor.net/articles/news/11/1/25 
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html 
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html 
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf 
http://www.endian.net/details.aspx?ItemNo=655 
http://www.hypertransport.org/ 
http://www.mbipr.com/whitepaper5.pdf 
http://www.sgi.com/products/remarketed/offering.html 
http://www.sun.com/processors/ 
http://www.theinquirer.net/?article=9235

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-06T00:55:58Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-06T00:43:40Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-06T00:42:45Z

Laaboue: /* Update section 1.1.3: Architectural Trends */

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-06T00:32:21Z

Laaboue:

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-04T23:41:48Z

Laaboue:

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

It is discussed in the book that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millinium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit wide datapaths and it is possible to see an increase to 256-bit wide datapaths soon, especially with the advancement in computer graphics, animations and gaming.

''Instruction-level parallelism'' took off as advancements in ''bit-level parallelism'' has receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more percise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache on chip, cache coherency protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage to the next level of parallelism to be exploited; ''thread-level parallelism.''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''Multiple Cores'' on chip and the second is ''Simultaneous Multi-Threading (SMT)'' (also known as ''Hyper-Threading''). Industry refrained at this point from using the clock speed as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors exist with four and eight cores with the promise to have sixteen cores on a chip in a matter of months.

==System Design Trends==

==Figures==
http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG
http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG

==References==
http://www.endian.net/details.aspx?ItemNo=655
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www.theinquirer.net/?article=9235
http://www.sun.com/processors/

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-04T23:39:53Z

Laaboue:

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

It is discussed in the book that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millinium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit wide datapaths and it is possible to see an increase to 256-bit wide datapaths soon, especially with the advancement in computer graphics, animations and gaming.

''Instruction-level parallelism'' took off as advancements in ''bit-level parallelism'' has receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more percise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache on chip, cache coherency protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage to the next level of parallelism to be exploited; ''thread-level parallelism.''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''Multiple Cores'' on chip and the second is ''Simultaneous Multi-Threading (SMT)'' (also known as ''Hyper-Threading''). Industry refrained at this point from using the clock speed as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors exist with four and eight cores with the promise to have sixteen cores on a chip in a matter of months.

==System Design Trends==

==Figures==
[[Image:http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG]]
[[Image:http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG]]

==References==
http://www.endian.net/details.aspx?ItemNo=655
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www.theinquirer.net/?article=9235
http://www.sun.com/processors/

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-04T23:02:31Z

Laaboue:

=Update section 1.1.3: Architectural Trends=

==Microprocessor Design Trends==

It is discussed in the book that up to 1986, advancement in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the norm has been established to be at 64-bit since the start of the millinium. A 128-bit datapath is rarely mentioned to be used in microprocessors. However, graphics processors have been using 128-bit wide datapaths and it is possible to see an increase to 256-bit wide datapaths soon, especially with the advancement in computer graphics, animations and gaming.

''Instruction-level parallelism'' took off as advancements in ''bit-level parallelism'' has receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more percise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache on chip, cache coherency protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage to the next level of parallelism to be exploited; ''thread-level parallelism.''

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''Multiple Cores'' on chip and the second is ''Simultaneous Multi-Threading (SMT)'' (also known as ''Hyper-Threading''). Industry refrained at this point from using the clock speed as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors exist with four and eight cores with the promise to have sixteen cores on a chip in a matter of months.

==System Design Trends==

==References==
http://www.endian.net/details.aspx?ItemNo=655
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www.theinquirer.net/?article=9235
http://www.sun.com/processors/

CSC/ECE 506 Fall 2007/wiki1 4 la

2007-09-04T22:58:05Z

Laaboue: