CSC/ECE 506 Spring 2011/ch4a zz: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(262 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Introduction ==
= Introduction =


[[Image:Comp4.jpg|thumb|right|400px|The Cray X1E supercomputer at the Oak Ridge National Laboratory requires a separate array of discs to handle its storage needs. Here are some of those discs, which are arranged all around the supercomputer]]
[[Image:Nasanbody.jpg|thumb|left|N-body Problem. Picture source: NASA]]  
[[Image:Comp5.jpg|thumb|right|400px|Here's a look inside one of the Cray X1E's panels. You can see just one set of the processors and control systems that run what is currently the 175th most powerful computer in the world]]
N-body problem is one of the most important topics in [http://en.wikipedia.org/wiki/Celestial_mechanics celestial mechanics]. The [http://en.wikipedia.org/wiki/N-body_problem#Mathematical_formulation_of_the_n-body_problem mathematical formulation of N-body problem] is a little bit beyond the scope of this book so it will not be discussed here. Based on the mathematics, we can simplify the N-body problem as follows: Select the position and velocity of N celestial bodies as [http://en.wikipedia.org/wiki/State_space_%28controls%29#State_variables state variables]. Given the [http://en.wikipedia.org/wiki/Initial_value_problem initial condition] of N bodies, compute their states at arbitrary time T. Normally a three-dimensional space is considered for N-body problem. There is a simplified N-body problem called restricted N-body problem where the mass of some of the bodies is negligible. Several remarkable three-body simulation can be found in [[#References|[1]]].
[[Image:Comp8.jpg|thumb|right|400px|This is Jaguar, currently rated the fifth most powerful computer in the world, with 54 teraflops. It uses 7,832 AMD Barcelona quad-core Opteron processors]]


A '''supercomputer''' is generally considered to be the front-line “cutting-edge” in terms of processing capacity (number crunching) and computational speed at the time it is built, but with the pace of development yesterdays' supercomputers become regular servers today. A state-of-the-art supercomputer is an extremely powerful computer capable of manipulating massive amounts of data in a relatively short amount of time.  
[[Image:3body.png|thumb|right|The trajectory of restricted three-body system [[#References|[1]]] ]]
Supercomputers are very expensive and are deployed for specialized scientific and engineering applications that must handle very large databases or do a great amount of computation --among them meteorology, animated graphics, fluid dynamic calculations, nuclear energy research, weapons simulation and petroleum exploration.


The United States government has played the key role in the development and use of supercomputers, During World War II, the US Army paid for the construction of ENIAC in order to speed the calculations of artillery tables. In the 30 years after World War II, the US government used high-performance computers to design nuclear weapons, break codes, and perform other security-related applications.
Many mathematicians have proofed that it is impossible to find a [http://en.wikipedia.org/wiki/Closed-form_expression closed-form solution] for n-body problem analytically[[#References|[2]]][[#References|[3]]]. The system could become [http://www.scholarpedia.org/article/Stability unstable] very easily. However, the problem can be solved [http://en.wikipedia.org/wiki/Numerical_analysis numerically]. The most common approach is to iterate over a sequence of small time steps. Within each time step, the acceleration on a body is approximated by the transient acceleration in the previous time step. The transient acceleration on a single body can be directly computed by summing the gravity from each of the other N-1 bodies. While this method is conceptually simple and is the algorithm of choice for many applications, its ''O''(''N''<sup>2</sup>).  


The most powerful supercomputers introduced in the 1960s were designed primarily by Seymour Cray at Control Data Corporation (CDC). They led the market into the 1970s until Cray left to form his own company, Cray Research.
The simulation of N-body system can be used from simulation of celestial bodies ([http://www.mpa-garching.mpg.de/galform/virgo/millennium/index.shtml gravitational interaction])to interactions of a set of particles (electromagnetic interaction).
With Moore’s Law still holding after more than thirty years, the rate at which future mass-market technologies overtake today’s cutting-edge super-duper wonders continues to accelerate.  The effects of this are manifest in the abrupt about-face we have witnessed in the underlying philosophy of building supercomputers.


During the 1970s and all the way through the mid-1980s supercomputers were built using specialized custom vector processors working in parallel. Typically, this meant anywhere between four to sixteen CPUs. The next phase of the supercomputer evolution saw the introduction of massive parallel processing and a drift away from vector-only microprocessors. However, the processors used in the construction of this generation of supercomputers were still primarily highly specialized purpose-specific custom designed and fabricated units.
= Parallel N-body problem =


That is no longer true. No longer is silicon fabricated into the incredibly expensive highly specialized purpose-specific customized microprocessor units to serve as the heart and mind of supercomputers. Advances in mainstream technologies and economies of scale now dictate that “off-the-shelf” multicore server-class CPUs are assembled into great conglomerates, combined with mind-boggling quantities of storage (RAM and HDD), and interconnected using light-speed transports.
As MIT professors Dimitri Bertsekas and John Tsitsiklis stated in their book [http://web.mit.edu/dimitrib/www/pdc.html "Parallel and Distributed Computation: Numerical Methods"], parallel computers have provided an excellent platform for numerical analysis. In the N-body problem, since each particle will interact with others by the force of gravity, the simulation of N-body system is computationally expensive for large numbers of N. There a ''O''(''N''<sup>2</sup>) interactions to compute for every iteration. Furthermore, in order to have a accurate result, the discrete time step must relatively small. Thus, there has been a huge interest in faster parallel algorithm for N-body problem.  


So we now find that instead of using specialized custom-built processors in their design, supercomputers are based on "off the shelf" server-class multicore microprocessors, such as the IBM PowerPC, Intel Itanium, or AMD x86-64. The modern supercomputer is firmly based around massively parallel processing by clustering very large numbers of commodity processors combined with a custom interconnect.
== Data-parallel Implementation ==


Currently the fastest supercomputer is the Blue Gene/L, completed at Lawrence Livermore National Laboratory in 2005 and upgraded in 2007. It utilizes 212,992 processors to execute potentially as many 596 trillion mathematical operations per second. The computer is used to do nuclear weapons safety and reliability analysis. A prototype of Blue Gene/L demonstrated in 2003 was air-cooled, as opposed to many high-performance machines that use water and refrigeration, and used no more power than the average home. In 2003 scientists at Virginia Tech assembled a relatively low-cost supercomputer using 1,100 dual-processor Apple Macintoshes; it was ranked at the time as the third fastest machine in the world.
=== Barnes-Hut Tree (BH tree)  ===
In 1985, Appel took the first step of decomposes the problem by introducing a tree structure[[#References|[4]]]. In the next year, Barnes and Hut extend the tree-based force calculation with logarithmic growth of force terms per particle[[#References|[5]]]. That is, construct a tree hierarchy(BH tree) of bodies based on the partition of entire space. The partition is shown on the left. The empty block have been pruned, thus the traversal time have been reduced from''O''(''N''<sup>2</sup>) to ''O''(''N'' log''N'').


Some of the companies which build supercomputers are Silicon Graphics, Intel, IBM, Cray, Orion, Aspen Systems etc.
[[Image:BHtree.jpg|center|frame|A recursive partition in two dimension and its corresponding BH tree. Picture from [[#References|[6]]] ]]


Here is a list of the top 10 supercomputers [http://royal.pingdom.com/2009/06/11/10-of-the-coolest-and-most-powerful-supercomputers-of-all-time/ top10 supercomputers] as of June 2009.
<pre>
For each time step:
1. Build the BH tree.
2. Compute centers-of-mass bottom-up
3. For each body
  Star a depth-first traversal of the tree;
  Truncating the search at internal nodes where the approximation is applicable;
  Update the contribution of the node to the acceleration of the body.
4. Update the velocity and position of each body.
</pre>


== A Brief History of Supercomputers ==
There are several challenges of this decomposition:
# Unlike the ocean current can be represented evenly by regular grid points in the entire space, the density of the galaxies is varying in different space. In some spaces the density of stars maybe very high, but in some other spaces there might be only few stars at certain time. This implies the computation load for each body will be different.
# The other challenge is that the position of the body is time-varying, which means the static assignment (which has been used in ocean application) may not work well.
# The force calculation of certain a body needs information from other bodies. In order to reduce the communication among processors, the partitions needs to be contiguous in any directions.
# The last challenge is that there is a trade-off between the level of decomposition and communication. That is, the fine grained approach could speed-up the force calculation but it will need more communications between different groups of stars.


=== First Supercomputer ( ENIAC )===
Even though the Barnes-Hut tree decomposition algorithm have these challenges, BH tree is still the one of most efficient methods for solving the n-body problem.


The ENIAC was first developed in 1949 and it took the world by storm. Originally, it was built to solve very complex problems that would take several months or years to solve. Because of this some of us use computers today but ENIAC was built with a single purpose: to solve scientific problems for the entire nation. The military were first to use it, benefiting the country's defenses. Even today, most new supercomputer technology is designed for the military first, and then is redesigned for civilian uses.
=== Orthogonal Recursive Bisection (ORB) ===


This system actually was used to compute the firing tables for White Sands missile range from 1949 until it was replaced in 1957. This allowed the military to synchronize the liftoff of missiles should it be deemed necessary. This was one of the important milestones in military history for the United States, at least on a technological level.
[[Image: orthogonal recursive bisection.gif|right|frame|upright=0.5| [http://en.wikipedia.org/wiki/CUDA Orthogonal Recursive Bisection] ]]


ENIAC was a huge machine that used nineteen thousand vacuum tubes and occupied a massive fifteen thousand square feet of floor space. It weighted nearly thirty tons, making it one of the largest machines of the time. It was considered the greatest scientific invention up to this point because  it took only 2 hours of computation time to do what normally took a team of one hundred engineers a period of a year. That made it almost a miracle in some people's eyes and people got excited about this emerging technology. ENIAC could perform five thousand additions in seconds.  Though it seemed very fast, by today's standards, that is extremely slow. Most computers today do millions of additions per second in comparison.
Unlike BH tree divided the space into square cells, the Orthogonal Recursive Bisection (ORB) divided space into rectangular with same number of bodies in each of them. It is a recursive approach because all processors are associated with the entire space at the very beginning; then a orthogonal line has been drawn to separate the space into bisection; each of the subsection has associated with half of the processors. Until the number of partition is equal to the number of processors.


So what made ENIAC run? That task took a lot of manpower to complete and took hours to set up.The people completing the task used board, plus and wires to program the desired commands into the colossal machine. They also had to input the numbers by turning tons of dials until they matched the correct numbers, much like one has to do on a combination lock.
=== Other Data-parallel Implementation ===


=== Cray ===
When using data-parallel algorithm to solve the N-body problem, one obvious approach for the data-parallel implementation is to divide the interactions into different sets based on the distances from current particle to other "interacting" particles. Thus the force from 'far away' particles can be updated less frequently or even can be ignored.
[[Image: data.jpg|thumb|left|Adaptive decompositions for a given box R, showing examples for separations of 1 and 2. Figure from [[#References|[6]]] ]]


Cray Inc. has a history that extends back to 1972, when the legendary Seymour Cray, the "father of supercomputing," founded Cray Research. R&D and manufacturing were based in his hometown of Chippewa Falls, Wisconsin and business headquarters were in Minneapolis, Minnesota.
By separating the blocks by the Hamming distance to the R block. Three interaction list could be created by the distance: The direct interaction list ''Dr''; the far interaction list ''Fr'' and the intermediate interaction list ''Hr''


The first Cray-1 system was installed at Los Alamos National Laboratory in 1976 for $8.8 million. It boasted a world-record speed of 160 million floating-point operations per second (160 megaflops) and an 8 megabyte (1 million word) main memory. In order to increase the speed of the system, the Cray-1 had a unique "C" shape which made integrated circuits to be closer together and no wire in the system was more than four feet long. To handle the intense heat generated by the computer, Cray developed an innovative refrigeration system using Freon.
The steps for each iteration using the given decomposition of the simulation are [[#References|[6]]]:
# Compute the multipole expansion coefficients for all leaves in the tree decomposition.
# Compute the multipole expansion coefficients for all internal nodes in the tree with depth ≥ 2.
# Compute the local expansion coefficients for a region R by summing R’s parent’s local expansion (shifted from the parent’s center to the center of R) with the sum of all of the multipole expansions in R’s far list, ''Fr''(converting the multipole expansions to local expansions and shifting to the center of R) for all regions with depth ≥ 2.
# For each body, b, in each leaf region, R, compute all the direct forces on b from all the bodies in the regions in R’s direct interaction list (''Dr'').
# For each body, b, in each leaf region, R, compute the far force on b by evaluating the local expansion for region R at b’s position.
# For each body, b, in each leaf region, R, compute the intermediate force by evaluating the multipole expansion at b’s position for each region in R’s intermediate interaction list (''Hr'').
# Sum the 3 components of the force and potential for each body.
# Apply the forces, updating the positions and velocities, and move the bodies to their proper regions as indicated by boundary crossing.


In order to concentrate his efforts more on the design, Cray left the CEO position in 1980 and became an independent contractor. Later he worked on the follow-on to the Cray-1, another group within the company developed the first multiprocessor supercomputer, the Cray X-MP, which was introduced in 1982. The Cray-2 system made its debut in 1985, providing a tenfold increase in performance over the Cray-1. In 1988, Cray Y-MP was introduced, the world's first supercomputer to sustain over 1 gigaflop on many applications. Multiple 333 MFLOPS processors powered the system speeds of up-to 2.3 gigaflops.
Another data-parallel implementation can be found in [[#References|[8]]]. The drawback of this type of approach is obvious: the error may be increased significantly due to the clustering process.


Always a visionary, Seymour Cray had been exploring the use of gallium arsenide in creating a semiconductor faster than silicon. However, the costs and complexities of this material made it difficult for the company to support both the Cray-3 and the Cray C90 development efforts. In 1989, Cray Research spun off the Cray-3 project into a separate company, Cray Computer Corporation, headed by Seymour Cray and based in Colorado Springs, Colorado. Tragically, Seymour Cray died in September 1996 at the age of 71.
== Shared-memory Implementation ==


The 1990s brought a number of transformations to Cray Research. The company continued its leadership in providing the most powerful supercomputers for production applications. The Cray C90 featured a new central processor which produced a performance of 1 gigaflop. Using 16 of these powerful processors and 256 million words of central memory, the system boasted of amazing total performance. The company also produced its first "mini-supercomputer," the Cray XMS system, followed by the Cray Y-MP EL series and the subsequent Cray J90. In 1993, it offered the first massively parallel processing (MPP) system, the Cray T3D supercomputer, and quickly captured MPP market leadership from early MPP companies such as Thinking Machines and MasPar. The Cray T3D proved to be exceptionally robust, reliable, sharable and easy-to-administer, compared with competing MPP systems.
=== Costzones ===


The successor, Cray T3E supercomputer has been the world's best selling MPP system. The Cray T3E-1200E system was the first supercomputer to sustain one teraflop (1 trillion calculations per second) on a real-world application. In November 1998, a joint scientific team from Oak Ridge National Laboratory, the National Energy Research Scientific Computing Center (NERSC), Pittsburgh Supercomputing Center and the University of Bristol (UK) ran a magnetism application at a sustained speed of 1.02 teraflops. In another technological landmark, the Cray T90 became the world's first wireless supercomputer when it was released in 1994. Also the Cray J90 series which was released during the same year has become the world's most popular supercomputer, with over 400 systems sold.
[[Image: Costzone.png|center|frame| [http://www.eecs.berkeley.edu/~demmel/cs267/lecture27/lecture27.html The partitioning of Costzones] ]]


Cray Research merged with SGI (Silicon Graphics, Inc.) in February 1996 then the company was renamed Cray Inc. and the ticker symbol was changed to CRAY. In August 1999, SGI created a separate Cray Research business unit to focus completely and exclusively on the unique requirements of high-end supercomputing customers. Assets of this business unit were sold to Tera Computer Company in March 2000. Tera began software development for the Multithreaded Architecture (MTA) systems in 1988 and hardware design commenced in 1991. The Cray MTA-2 system provides scalable shared memory, in which every processor has equal access to every memory location, which greatly simplifies the programming because it eliminates concerns about the layout of memory.Company received its first order for the MTA from the San Diego Supercomputer Center. The multiprocessor system was accepted by the center in 1998, and has since been upgraded to eight processors.
It is easy to share the BH tree in a shared memory machine. The so-called Costzones is one of the approaches to implement the BH tree algorithm in the shared-memory machine. We know that the processor will accesses more the nearby bodies than distant ones. Since the BH tree already contain the spatial information of bodies (more likely the nearby bodies will have a same parent with me), the Costzones technique simply uses the data structure created by the BH tree.
The link below shows the Historical Timeline of Cray in the field of Supercomputers.
[http://www.cray.com/Assets/PDF/about/CrayTimeline.pdf Historical Timeline of Cray].


=== Supercomputers in Japan ===
The costzones method have a better performance in a shared-memory machine compared with the ORB method. The reason is mainly because the time spent in the space partition is shorter.


In the beginning there were only a few Cray-1s installed in Japan, and until 1983 no Japanese company produced supercomputers. The first models were announced in 1983. Naturally there had been prototypes earlier like the Fujitsu F230-75 APU produced in two copies in 1978, but Fujitsu's VP-200 and Hitachi's S-810 were the first officially announced versions. NEC announced its SX series slightly later.
=== Other Shared-memory Implementation ===


The last decade has rather been a surprise. About three generations of machines have been produced by each of the domestic manufacturers. During the last ten years about 300 supercomputer systems have been shipped and installed in Japan, and a whole infrastructure of supercomputing has been established. All major universities, many of the large industrial companies and research centers have supercomputers.
Besides the Costzones method of implementing the N-body simulation on shard-memory machines, there is a tremendous interests of using [http://en.wikipedia.org/wiki/CUDA CUDA] to solve the N-body problem by modern GPUs [[#References|[10]]][[#References|[11]]].
 
In 1984 the NEC announced the SX-1 and SX-2 and started delivery in 1985. The first two SX-2 systems were domestic deliveries to Osaka University and the Institute for Computational Fluid Dynamics (ICFD). The SX-2 had multiple pipelines with one set of add and multiply floating point units each.It had a cycle time of 6 nanoseconds so each pipelined floating-point unit could peak at 167 Mflop/s. With four pipelines per unit and two floating-point units, the peak performance was about 1.3 Gflop/s. Due to limited memory bandwidth and other issues the performance in benchmark tests was less than half the peak value. The SX-1 had a slightly higher cycle time of 7 ns than the SX-2 and had only half the number of pipelines. The maximum execution rate was 570 Mflop/s.


At the end of 1987, NEC improved its supercomputer family with the introduction of A-series which gave improvements to the memory and I/O bandwidth. The top model, the SX-2A, had the same theoretical peak performance as the SX-2. Several low-range models were also announced but today none of them qualify for the TOP500.
<pre>
global void
calculate_forces(void *devX, void *devA)
{
extern __shared__ float4[] shPosition;
float4 *globalX = (float4 *)devX;
float4 *globalA = (float4 *)devA;
float4 myPosition;
int i, tile;
float3 acc = {0.0f, 0.0f, 0.0f};
int gtid = blockIdx.x * blockDim.x + threadIdx.x;
myPosition = globalX[gtid];
for (i = 0, tile = 0; i < N; i += p, tile++) {
int idx = tile * blockDim.x + threadIdx.x;
shPosition[threadIdx.x] = globalX[idx];
syncthreads();
acc = tile_calculation(myPosition, acc);
syncthreads();
}
// Save the result in global memory for the integration step.
float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};
globalA[gtid] = acc4;
}
</pre>


In 1989 NEC announced a rather aggressive new model, the SX-3, with several important changes. The vector cycle time was brought down to 2.9 ns, the number of pipelines was doubled, but most significantly NEC added multiprocessing capability to its new series. It contained four independent arithmetic processors each with a scalar and a vector processing unit and NEC increased its performance by more than one order of magnitude of 22 Gflop/s from 1.33 on the SX-2A. The combination of these features made SX-3 the most powerful vector processor in the world. The total memory bandwidth was subdivided into two halves which in turn featured two vector load and one vector store paths per pipeline set as well as one scalar load and one scalar store path. This gave a total memory bandwidth to the vector units of about 66 GB/s. Like its predecessors, the SX-3 was to offer the memory bandwidth needed to sustain peak performance unless most operands were contained in the vector registers.
CUDA shows an impressive potential to execute extremely parallel applications such as N-body problem. Twenty years ago, the N-body simulation can only runs on super computers. Because of the development of CUDA makes the classical N-body problem become a term project for some graduate students (see [http://www.evl.uic.edu/sjames/cs525/project2.html 2008 Fall CS 525 Project2 n-body simulation in CUDA]).


In 1992 NEC announced the SX-3R with a couple of improvements compared to the first version. The clock was further reduced to 2.5 ns, so that the peak performance increased to 6.4 Gflop/s per processor
== Message-passing Implementation ==


==== Fujitsu's VP series ====
[[Image: Octtree.png|frame|right|A quad-tree shown along with the binary key coordinates of the nodes. Figure from [[#References|[9]]] ]]


In 1977 Fujitsu produced the first supercomputer prototype called the F230-75 APU which was a pipelined vector processor added to a scalar processor. This attached processor was installed in the Japanese Atomic Energy Commission (JAERI) and the National Aeronautic Lab (NAL).
Implementing the BH tree algorithm is more difficult in a distributed-memory system compared with the shared-memory implementation. Because we need assign the different bodies to different processor dynamically. When the location of bodies have been changed exceed a certain limit, the BH tree needs to be reconstruct. The good news is we don't need reconstruct the BH tree very frequently(compared with the simulation step size).  


In 1983 the company came out with the VP-200 and VP-100 systems. In 1986 VP-400 was released with twice as many pipelines as the VP-200 and during mid-1987 the whole family became the E-series with the addition of an extra (multiply-add) pipelined floating point unit that increased the performance potential by 50%. With the flexible range of systems in this generation (VP-30E to VP-400E), good marketing and a broad range of applications, Fujitsu has became the largest domestic supplier with over 80 systems installed, many of which are named in TOP500.
In message-passing architectures, a processor’s local memory is often used as a software-controlled cache
for communicated data, in addition to holding the processor’s own assigned data partition; that is, data are replicated in the memories of those processors that need to use them.  


Available since 1990, the VP-2000 family can offer a peak performance of 5 Gflop/s due to a vector cycle time of 3.2 ns. The family was initially announced with four vector performance levels (model 2100, 2200, 2400, and 2600) where each level could have either one of two scalar processors, but the VP-2400/40 doubled this limit offering a peak vector performance similar to the VP-2600. Most of these models are now represented in the Japanese TOP500.
Warren and Salmon introduced a parallel hashed Oct-tree N-body algorithm[[#References|[9]]]. The hashed Oct-tree is the message-passing version of BH tree implementation. One of the advantages of using hash table is that the non-local data may be accessed by requesting a key, which is a uniform addressing scheme and easy to be implemented on a message passing architecture. The pseudocode of tree traversal are shown below:


Previous machines wre heavily criticized for the lack of memory throughput. The VP-400 series had only one load/store path to memory that peaked at 4.57 GB/s. This was improved in the VP-2000 series by doubling the paths so that each pipeline set can do two load/store operations per cycle giving a total transfer rate of 20 GB/s. Fujitsu recently decided to use the label, VPX-2x0, for the VP-2x00 systems adapted to their Unix system. Keio Daigaku university now runs such a system.
<pre>
ListTraverse((*MAC)(hcell *))
{
copy root to walk_list;
while (!Empty(walk_list)) {
for (each item on walk_list) {
for (each daughter of item) {
if (MAC(daughter))
copy daughter to interact_list;
else
copy daughter to output_walk_list;
}
}
walk_list = output_walk_list;
}
}
</pre>


==== The VPP-500 series ====
There are two major challenges of using message-passing parallel to solve the N-body problem:
# The nonuniform and dynamic nature of the particle distribution, which implies that the partitions assigned to processors change with time to maintain load balancing and data locality;
# The need for long-range communication which is irregular as a result of the nonuniformity.


In 1993 Fujitsu sprung a surprise to the world by announcing a Vector Parallel Processor (VPP) series that was designed for reaching in the range of hundreds of Gflop/s. At the core of the system is a combined Ga-As/Bi-CMOS processor, based largely on the original design of the VP-200. The processor chips gate delay was made as low as 60 ps in the Ga-As chips by using the most advanced hardware technology available. The resulting cycle time was 9.5 ns. The processor has four independent pipelines each capable of executing two Multiply-Add instructions in parallel resulting in a peak speed of 1.7 Gflop/s per processor. Each processor board is equipped with 256 Megabytes of central memory.
The typical message-passing strategy of allowing communicated data to accumulate in local memory and flushing them out at
certain points in the program does not work well here like it does in regular predictable programs. Thus, the only convenient points at which to flush communicated data in the application program are the boundaries between computational phases. Since the amount of nonlocal data read from the tree during the force computation phase can be quite large, the memory overhead due to data replication can be large when replacement is managed in this typical message-passing style.


The most amazing part of the VPP-500 is the capability to interconnect up to 222 processors via a cross-bar network with two independent (read/write) connections, each operating at 400 MB/s. The total memory is addressed via virtual shared memory primitives. The system is meant to be front-ended by a VP-2x00 system that handles input/output and permanent file store, and job queue logistics.
== Other Parallel Approaches ==
Other than the conventional approaches, there are also some new techniques have been developed to simulate the N-body problem:


As mentioned in the introduction, an early version of this system called the Numeric Wind Tunnel, was developed together with NAL. This early version of the VPP-500 (with 140 processors) is today the fastest supercomputer in the world and stands out at the beginning of the TOP500 due to a value that is twice that of the TMC CM-5/1024 installed at Los Alamo.
* Hardware side: The researcher in University of Tokyo have designed and built a special purpose hardware called GRAPE for the N-body simulation. The chip has been designed to optimize the computation of gravitational interactions.[[#References|[12]]] The GRAPE has been keep developed by researchers, the most updated version is [http://adsabs.harvard.edu/abs/2003PASJ...55.1163M GRAPE 6].  


==== Hitachi's Supercomputers ====
* Software (algorithm) side: On the other hand, the high-level architecture-independent algorithms also have been develop to simulate the N-body problem[[#References|[13]]]. These algorithm allowed researchers test their prototype parallel program under different architecture without using the machine-specified languages.


Hitachi has been producing supercomputers since 1983 but differs from other manufacturers by not exporting them. For this reason, their supercomputers are less well known in the West. After having gone through two generations of supercomputers, the S-810 series started in 1983 and the S-820 series in 1988, Hitachi leapfrogged NEC in 1992 by announcing the most powerful vector supercomputer ever.The top S-820 model consisted of one processor operating at 4 ns and contained 4 vector pipelines with four pipelines and two independent floating-point units. This corresponded to a peak performance of 2 Gflop/s. Hitachi put great emphasis on a fast memory although this meant limiting its size to a maximum of 512 MB. The memory bandwidth of 2 words per pipe per vector cycle, giving a peak rate of 16 GB/s was a respectable achievement, but it was not enough to keep all functional units busy.
=References=


The S-3800 was announced two years ago and is comparable to NEC's SX- 3R in its features. It has up to four scalar processors with a vector processing unit each. These vector units have in turn up to four independent pipelines and two floating point units that can each perform a multiply/add operation per cycle. With a cycle time of 2.0 ns, the whole system achieves a peak performance level of 32 Gflop/s.
[1] [http://faculty.ifmo.ru/butikov/Projects/Collection1.html Collection of remarkable three-body motions]


The S-3600 systems can be seen as the design of the S-820 recast in more modern technology. The system consists of a single scalar processor with an attached vector processor. The 4 models in the range correspond to a successive reduction of the number of pipelines and floating point units installed.
[2] [http://en.wikipedia.org/wiki/Henri_Poincar%C3%A9 About Henri Poincaré]  
Link showing the list of the top 500 super computers
[http://www.top500.org/list/2009/11/100 top 500 super computers].
Link showing the statistics of top 500 supercomputer
[http://www.top500.org/static/lists/2009/11/top500_statistics.pdf statistics]


=== IBM's world's fastest supercomputer ===
[3] Diacu, F (01/01/1996). [http://rutherglen.science.mq.edu.au/math332s207/diacuNbody.pdf "The solution of the n-body problem"] . The Mathematical intelligencer (0343-6993), 18 (3), p. 66.


Sequoia, built for the US department of energy, is almost 20 times more powerful than the previous record holder. IBM today smashed the existing record as it unveiled the world's fastest supercomputer. The new system, dubbed Sequoia, will be able to achieve speeds of up to 20 Petaflops - 20 quadrillion calculations per second - the equivalent of more than 2m laptops. Sequoia will consist of around 1.6m computer chips, giving it the ability to perform an order of magnitude faster than the 1.1 Petaflop Blue Gene/L computer, which is currently recognised as the world's most powerful.
[4]  A. Appel, [http://epubs.siam.org.www.lib.ncsu.edu:2048/sisc/resource/1/sjoce3/v6/i1/p85_s1 "An Efficient Program for Many-Body Simulation"] , SIAM J. Scientific and Statistical Computing, vol. 6, 1985 (available at NCSU library)


It is built by IBM for the US department of energy and should be installed at the Lawrence Livermore National Laboratory in California by 2012. The LLNC is one of the world's leading laboratories dedicated to national security, where teams of scientists work on projects linked to nuclear energy, environmental protection and economic issues. Sequoia will be used to simulate nuclear tests and explosions, alongside a smaller machine, known as Dawn, which is currently being built. Both systems will be used for the ongoing safety and reliability of USA's nuclear stockpile.
[5] Barnes. Josh, Hut. Piet, (12/04/1986). [http://www.nature.com.www.lib.ncsu.edu:2048/nature/journal/v324/n6096/abs/324446a0.html "A hierarchical O(N log N) force-calculation algorithm".] Nature (London) (0028-0836), 324 (6096), p. 446. (available at NCSU library)


Supercomputer speeds are advancing rapidly as manufacturers latch on to new techniques and cheaper prices for computer chips. The first machine to break the teraflop barrier - a trillion calculations per second - was only built in 1996. Two years ago a $59m machine from Sun Microsystems, called Constellation, attempted to take the crown of world's fastest with operating speeds of 421 teraflops. Just two years later, Sequoia was able to achieve nearly 50 times the computing power.
[6] [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5549 A Data-Parallel Implementation of the Adaptive Fast Multipole Algorithm 1993]
 
[7] Johnsson, SL (01/01/1996). [http://hpc.sagepub.com/content/10/1/3.abstract "A data-parallel implementation of hierarchical N-body methods"]. The international journal of supercomputer applications and high performance computing (1078-3482), 10 (1), p. 3.
 
[8] [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.8494 Implementing N-body Algorithms Efficiently in Data-Parallel Languages (1996)]
 
[9] [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.100 A Parallel Hashed Oct-Tree N-Body Algorithm (1993)]
 
[10] [http://www.cse.buffalo.edu/faculty/miller/Courses/CSE633/Suraj-Balchand-F2010.pdf N-Body Simulation using CUDA]
 
[11] [http://www.google.com/url?sa=t&source=web&cd=3&ved=0CCIQFjAC&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Bjsessionid%3D55F4FCF50625252377FC37BD08D3A47D%3Fdoi%3D10.1.1.156.7082%26rep%3Drep1%26type%3Dpdf&rct=j&q=n%20body%20shared%20memory%20&ei=tHJsTdaJIdKgtwfq7ZHqBQ&usg=AFQjCNGK_eYtZu2q0VvP0hDNBYQ2izKUAA&cad=rja Fast N-Body Simulation with CUDA]
 
[12] Tomoyoshi Ito, Junichiro Makino, Toshikazu Ebisuzaki, Daiichiro Sugimoto, [http://www.sciencedirect.com/science/article/B6TJ5-46FXBXB-7D/2/6fa9a86939397d2d81b52a6e066f5642 A special-purpose N-body machine GRAPE-1], Computer Physics Communications, Volume 60, Issue 2, September 1990, Pages 187-194.
 
[13] Mills, P.H.; Nyland, L.S.; Prins, J.F.; Reif, J.H.; , [http://www.cs.duke.edu/~reif/paper/proteus/ipps92.pdf "Prototyping N-body simulation in Proteus,"] Parallel Processing Symposium, 1992. Proceedings., Sixth International , vol., no., pp.476-482, 23-26 Mar 1992
 
[14] David E. Culler, Jaswinder Pal Singh and Anoop Gupta. [http://www2.lib.ncsu.edu/catalog/record/NCSU1020944 "Parallel computer architecture : a hardware/software approach"]Morgan Kaufmann Publishers, c1999.
 
= Appendix: Comparison of Performance =
 
The following table is a summary of sequential and parallel implementations of N-body simulation from [[#References|[7]]].
<table border="1" bordercolor="#FFFFFF" style="background-color:#FFFFFF" width="800" cellpadding="3" cellspacing="3">
<tr>
<td>Author</td>
<td>Method and Error</td>
<td>Program Model</td>
<td>N</td>
<td>Times (s)</td>
<td>P</td>
<td>Efficiency</td>
<td>Machine</td>
</tr>
<tr>
<td>Salmon (1990)</td>
<td>BH, quadrupole</td>
<td>Message passing</td>
<td>?</td>
<td>?</td>
<td>?</td>
<td>?</td>
<td>Ncude</td>
</tr>
<tr>
<td>Warren_Salmon (1993)</td>
<td>BH,ϵ = 10<sup>-3</sup></td>
<td>Message passing</td>
<td>8.78 M</td>
<td>114</td>
<td>512</td>
<td>28%</td>
<td>Intel Delta</td>
</tr>
<tr>
<td>Warren_Salmon (1995)</td>
<td>BH,ϵ = 10<sup>-2</sup></td>
<td>Message passing</td>
<td>2 M</td>
<td>10.8</td>
<td>256</td>
<td>?</td>
<td>CM-5E</td>
</tr>
<tr>
<td>Liu-Bhatt (1994)</td>
<td>BH, quadrupole</td>
<td>Message passing</td>
<td>10 M</td>
<td>59</td>
<td>256</td>
<td>30%</td>
<td>CM-5</td>
</tr>
<tr>
<td>Leathrum-Board(1992)</td>
<td>GR,p=8</td>
<td>Shared memory</td>
<td>1 M</td>
<td>1520</td>
<td>32</td>
<td>20%</td>
<td>KSR-1</td>
</tr>
<tr>
<td>Elliott-Board(1994)</td>
<td>GR,FFT,p=8</td>
<td>Shared memory</td>
<td>1 M</td>
<td>1420</td>
<td>32</td>
<td>14%</td>
<td>KSR-1</td>
</tr>
<tr>
<td>Schmidt-Lee(1991)</td>
<td>GR,p=8</td>
<td>?</td>
<td>40,000</td>
<td>94</td>
<td>1</td>
<td>39%</td>
<td>CRAY Y-MP 8/864</td>
</tr>
<tr>
<td>Zhao-Johnsson(1991)</td>
<td>Zhao, p=3</td>
<td>Data parallel</td>
<td>16,000</td>
<td>5</td>
<td>8 K</td>
<td>12%</td>
<td>CM-2</td>
</tr>
<tr>
<td>Hu-Johnsson</td>
<td>Anderson</td>
<td>Data parallel</td>
<td>100 M</td>
<td>180</td>
<td>256</td>
<td>27%-35%</td>
<td>CM-5E</td>
</tr>
<tr>
<td>Singh et al.(1993)</td>
<td>GR, 2-D, adaptive</td>
<td>Shared memory</td>
<td>?</td>
<td>?</td>
<td>?</td>
<td>?</td>
<td>DASH, KSR-1</td>
</tr>
 
</table>
 
All performance numbers are for uniform particle distributions. Methods used are for 3-D, unless other wise stated. "?" imply unavailable data.

Latest revision as of 05:28, 1 April 2011

Introduction

N-body Problem. Picture source: NASA

N-body problem is one of the most important topics in celestial mechanics. The mathematical formulation of N-body problem is a little bit beyond the scope of this book so it will not be discussed here. Based on the mathematics, we can simplify the N-body problem as follows: Select the position and velocity of N celestial bodies as state variables. Given the initial condition of N bodies, compute their states at arbitrary time T. Normally a three-dimensional space is considered for N-body problem. There is a simplified N-body problem called restricted N-body problem where the mass of some of the bodies is negligible. Several remarkable three-body simulation can be found in [1].

The trajectory of restricted three-body system [1]

Many mathematicians have proofed that it is impossible to find a closed-form solution for n-body problem analytically[2][3]. The system could become unstable very easily. However, the problem can be solved numerically. The most common approach is to iterate over a sequence of small time steps. Within each time step, the acceleration on a body is approximated by the transient acceleration in the previous time step. The transient acceleration on a single body can be directly computed by summing the gravity from each of the other N-1 bodies. While this method is conceptually simple and is the algorithm of choice for many applications, its O(N2).

The simulation of N-body system can be used from simulation of celestial bodies (gravitational interaction)to interactions of a set of particles (electromagnetic interaction).

Parallel N-body problem

As MIT professors Dimitri Bertsekas and John Tsitsiklis stated in their book "Parallel and Distributed Computation: Numerical Methods", parallel computers have provided an excellent platform for numerical analysis. In the N-body problem, since each particle will interact with others by the force of gravity, the simulation of N-body system is computationally expensive for large numbers of N. There a O(N2) interactions to compute for every iteration. Furthermore, in order to have a accurate result, the discrete time step must relatively small. Thus, there has been a huge interest in faster parallel algorithm for N-body problem.

Data-parallel Implementation

Barnes-Hut Tree (BH tree)

In 1985, Appel took the first step of decomposes the problem by introducing a tree structure[4]. In the next year, Barnes and Hut extend the tree-based force calculation with logarithmic growth of force terms per particle[5]. That is, construct a tree hierarchy(BH tree) of bodies based on the partition of entire space. The partition is shown on the left. The empty block have been pruned, thus the traversal time have been reduced fromO(N2) to O(N logN).

A recursive partition in two dimension and its corresponding BH tree. Picture from [6]
For each time step:
1. Build the BH tree.
2. Compute centers-of-mass bottom-up
3. For each body
   Star a depth-first traversal of the tree; 
   Truncating the search at internal nodes where the approximation is applicable;
   Update the contribution of the node to the acceleration of the body.
4. Update the velocity and position of each body.

There are several challenges of this decomposition:

  1. Unlike the ocean current can be represented evenly by regular grid points in the entire space, the density of the galaxies is varying in different space. In some spaces the density of stars maybe very high, but in some other spaces there might be only few stars at certain time. This implies the computation load for each body will be different.
  2. The other challenge is that the position of the body is time-varying, which means the static assignment (which has been used in ocean application) may not work well.
  3. The force calculation of certain a body needs information from other bodies. In order to reduce the communication among processors, the partitions needs to be contiguous in any directions.
  4. The last challenge is that there is a trade-off between the level of decomposition and communication. That is, the fine grained approach could speed-up the force calculation but it will need more communications between different groups of stars.

Even though the Barnes-Hut tree decomposition algorithm have these challenges, BH tree is still the one of most efficient methods for solving the n-body problem.

Orthogonal Recursive Bisection (ORB)

Orthogonal Recursive Bisection

Unlike BH tree divided the space into square cells, the Orthogonal Recursive Bisection (ORB) divided space into rectangular with same number of bodies in each of them. It is a recursive approach because all processors are associated with the entire space at the very beginning; then a orthogonal line has been drawn to separate the space into bisection; each of the subsection has associated with half of the processors. Until the number of partition is equal to the number of processors.

Other Data-parallel Implementation

When using data-parallel algorithm to solve the N-body problem, one obvious approach for the data-parallel implementation is to divide the interactions into different sets based on the distances from current particle to other "interacting" particles. Thus the force from 'far away' particles can be updated less frequently or even can be ignored.

Adaptive decompositions for a given box R, showing examples for separations of 1 and 2. Figure from [6]

By separating the blocks by the Hamming distance to the R block. Three interaction list could be created by the distance: The direct interaction list Dr; the far interaction list Fr and the intermediate interaction list Hr

The steps for each iteration using the given decomposition of the simulation are [6]:

  1. Compute the multipole expansion coefficients for all leaves in the tree decomposition.
  2. Compute the multipole expansion coefficients for all internal nodes in the tree with depth ≥ 2.
  3. Compute the local expansion coefficients for a region R by summing R’s parent’s local expansion (shifted from the parent’s center to the center of R) with the sum of all of the multipole expansions in R’s far list, Fr(converting the multipole expansions to local expansions and shifting to the center of R) for all regions with depth ≥ 2.
  4. For each body, b, in each leaf region, R, compute all the direct forces on b from all the bodies in the regions in R’s direct interaction list (Dr).
  5. For each body, b, in each leaf region, R, compute the far force on b by evaluating the local expansion for region R at b’s position.
  6. For each body, b, in each leaf region, R, compute the intermediate force by evaluating the multipole expansion at b’s position for each region in R’s intermediate interaction list (Hr).
  7. Sum the 3 components of the force and potential for each body.
  8. Apply the forces, updating the positions and velocities, and move the bodies to their proper regions as indicated by boundary crossing.

Another data-parallel implementation can be found in [8]. The drawback of this type of approach is obvious: the error may be increased significantly due to the clustering process.

Shared-memory Implementation

Costzones

The partitioning of Costzones

It is easy to share the BH tree in a shared memory machine. The so-called Costzones is one of the approaches to implement the BH tree algorithm in the shared-memory machine. We know that the processor will accesses more the nearby bodies than distant ones. Since the BH tree already contain the spatial information of bodies (more likely the nearby bodies will have a same parent with me), the Costzones technique simply uses the data structure created by the BH tree.

The costzones method have a better performance in a shared-memory machine compared with the ORB method. The reason is mainly because the time spent in the space partition is shorter.

Other Shared-memory Implementation

Besides the Costzones method of implementing the N-body simulation on shard-memory machines, there is a tremendous interests of using CUDA to solve the N-body problem by modern GPUs [10][11].

global void
calculate_forces(void *devX, void *devA)
{
extern __shared__ float4[] shPosition;
float4 *globalX = (float4 *)devX;
float4 *globalA = (float4 *)devA;
float4 myPosition;
int i, tile;
float3 acc = {0.0f, 0.0f, 0.0f};
int gtid = blockIdx.x * blockDim.x + threadIdx.x;
myPosition = globalX[gtid];
for (i = 0, tile = 0; i < N; i += p, tile++) {
int idx = tile * blockDim.x + threadIdx.x;
shPosition[threadIdx.x] = globalX[idx];
syncthreads();
acc = tile_calculation(myPosition, acc);
syncthreads();
}
// Save the result in global memory for the integration step.
float4 acc4 = {acc.x, acc.y, acc.z, 0.0f};
globalA[gtid] = acc4;
}

CUDA shows an impressive potential to execute extremely parallel applications such as N-body problem. Twenty years ago, the N-body simulation can only runs on super computers. Because of the development of CUDA makes the classical N-body problem become a term project for some graduate students (see 2008 Fall CS 525 Project2 n-body simulation in CUDA).

Message-passing Implementation

A quad-tree shown along with the binary key coordinates of the nodes. Figure from [9]

Implementing the BH tree algorithm is more difficult in a distributed-memory system compared with the shared-memory implementation. Because we need assign the different bodies to different processor dynamically. When the location of bodies have been changed exceed a certain limit, the BH tree needs to be reconstruct. The good news is we don't need reconstruct the BH tree very frequently(compared with the simulation step size).

In message-passing architectures, a processor’s local memory is often used as a software-controlled cache for communicated data, in addition to holding the processor’s own assigned data partition; that is, data are replicated in the memories of those processors that need to use them.

Warren and Salmon introduced a parallel hashed Oct-tree N-body algorithm[9]. The hashed Oct-tree is the message-passing version of BH tree implementation. One of the advantages of using hash table is that the non-local data may be accessed by requesting a key, which is a uniform addressing scheme and easy to be implemented on a message passing architecture. The pseudocode of tree traversal are shown below:

ListTraverse((*MAC)(hcell *))
{
copy root to walk_list;
while (!Empty(walk_list)) {
for (each item on walk_list) {
for (each daughter of item) {
if (MAC(daughter))
copy daughter to interact_list;
else
copy daughter to output_walk_list;
}
}
walk_list = output_walk_list;
}
}

There are two major challenges of using message-passing parallel to solve the N-body problem:

  1. The nonuniform and dynamic nature of the particle distribution, which implies that the partitions assigned to processors change with time to maintain load balancing and data locality;
  2. The need for long-range communication which is irregular as a result of the nonuniformity.

The typical message-passing strategy of allowing communicated data to accumulate in local memory and flushing them out at certain points in the program does not work well here like it does in regular predictable programs. Thus, the only convenient points at which to flush communicated data in the application program are the boundaries between computational phases. Since the amount of nonlocal data read from the tree during the force computation phase can be quite large, the memory overhead due to data replication can be large when replacement is managed in this typical message-passing style.

Other Parallel Approaches

Other than the conventional approaches, there are also some new techniques have been developed to simulate the N-body problem:

  • Hardware side: The researcher in University of Tokyo have designed and built a special purpose hardware called GRAPE for the N-body simulation. The chip has been designed to optimize the computation of gravitational interactions.[12] The GRAPE has been keep developed by researchers, the most updated version is GRAPE 6.
  • Software (algorithm) side: On the other hand, the high-level architecture-independent algorithms also have been develop to simulate the N-body problem[13]. These algorithm allowed researchers test their prototype parallel program under different architecture without using the machine-specified languages.

References

[1] Collection of remarkable three-body motions

[2] About Henri Poincaré

[3] Diacu, F (01/01/1996). "The solution of the n-body problem" . The Mathematical intelligencer (0343-6993), 18 (3), p. 66.

[4] A. Appel, "An Efficient Program for Many-Body Simulation" , SIAM J. Scientific and Statistical Computing, vol. 6, 1985 (available at NCSU library)

[5] Barnes. Josh, Hut. Piet, (12/04/1986). "A hierarchical O(N log N) force-calculation algorithm". Nature (London) (0028-0836), 324 (6096), p. 446. (available at NCSU library)

[6] A Data-Parallel Implementation of the Adaptive Fast Multipole Algorithm 1993

[7] Johnsson, SL (01/01/1996). "A data-parallel implementation of hierarchical N-body methods". The international journal of supercomputer applications and high performance computing (1078-3482), 10 (1), p. 3.

[8] Implementing N-body Algorithms Efficiently in Data-Parallel Languages (1996)

[9] A Parallel Hashed Oct-Tree N-Body Algorithm (1993)

[10] N-Body Simulation using CUDA

[11] Fast N-Body Simulation with CUDA

[12] Tomoyoshi Ito, Junichiro Makino, Toshikazu Ebisuzaki, Daiichiro Sugimoto, A special-purpose N-body machine GRAPE-1, Computer Physics Communications, Volume 60, Issue 2, September 1990, Pages 187-194.

[13] Mills, P.H.; Nyland, L.S.; Prins, J.F.; Reif, J.H.; , "Prototyping N-body simulation in Proteus," Parallel Processing Symposium, 1992. Proceedings., Sixth International , vol., no., pp.476-482, 23-26 Mar 1992

[14] David E. Culler, Jaswinder Pal Singh and Anoop Gupta. "Parallel computer architecture : a hardware/software approach"Morgan Kaufmann Publishers, c1999.

Appendix: Comparison of Performance

The following table is a summary of sequential and parallel implementations of N-body simulation from [7].

Author Method and Error Program Model N Times (s) P Efficiency Machine
Salmon (1990) BH, quadrupole Message passing ? ? ? ? Ncude
Warren_Salmon (1993) BH,ϵ = 10-3 Message passing 8.78 M 114 512 28% Intel Delta
Warren_Salmon (1995) BH,ϵ = 10-2 Message passing 2 M 10.8 256 ? CM-5E
Liu-Bhatt (1994) BH, quadrupole Message passing 10 M 59 256 30% CM-5
Leathrum-Board(1992) GR,p=8 Shared memory 1 M 1520 32 20% KSR-1
Elliott-Board(1994) GR,FFT,p=8 Shared memory 1 M 1420 32 14% KSR-1
Schmidt-Lee(1991) GR,p=8 ? 40,000 94 1 39% CRAY Y-MP 8/864
Zhao-Johnsson(1991) Zhao, p=3 Data parallel 16,000 5 8 K 12% CM-2
Hu-Johnsson Anderson Data parallel 100 M 180 256 27%-35% CM-5E
Singh et al.(1993) GR, 2-D, adaptive Shared memory ? ? ? ? DASH, KSR-1

All performance numbers are for uniform particle distributions. Methods used are for 3-D, unless other wise stated. "?" imply unavailable data.