== ''Introduction'' ==

The cache organization plays a key role in the modern computers, especially in the multiprocessors. As we scale the number of processors, subsequently the cache miss rate also increases. The high cache miss rate is a cause of concern, since it can significantly limit the performance of multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;

• '''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing. True sharing is intrinsic to a particular memory reference stream of a program and is not dependent on the block size.

• '''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.

Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.

==''Problem with False Sharing'' ==

In multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.

But, in a scenario, where multiple processors try to update individual elements in the same cache line, the entire cache line is invalidated, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited. This situation is called false sharing, and might become a bottleneck in the path of performance and scalability.

== ''Strategies to combat False Sharing'' ==

Several strategies have been proposed and are been worked up on in order to decrease false sharing in multi processor. This article would try to introduce a few of the salient ones.

* Reducing False Sharing through Proper Block Sizing
* Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks
* Reducing False Sharing through Compile Time Data Transformations
* Reducing False Sharing through Sectored Caches

=== Reducing False Sharing through Proper Block Sizing ===

An algorithm can be designed to select the block sizes, to minimize bus traffic through the use of variable (static) size blocks; i.e., the block size choice varies over the memory space of the program, but any given word is assigned to a specific fixed block size for the entire program execution. Starting with each word in the memory space that is used, neighboring blocks are combined. If when combined they produce less bus traffic than when left as single blocks. When neighboring words have similar access patterns and it is useful to prefetch one while demand fetching the other, the traffic is reduced when the words (or blocks) are grouped into a single unit due to fewer address transmissions over the bus. When excessive traffic is generated due to false sharing, the problem blocks are isolated by not combining them into larger units.

=== Reducing False Sharing through Optimizing the Layout of Shared Data in Cache Blocks ===

A very important parameter that affects false sharing, is the block size in a cache. An increase in the block size in case of uniprocessors tends to increase the spatial locality due to which the miss rate decreases. Where as in case of multiprocessors, increase in block size not only increases spatial locality but also increases the probability of false sharing. Thus the miss rate due to increased block size can go up or down in multiprocessors. The graph below shows the variations of miss rate as a function of the block size, done with respect to 16 and 32 processors. It depicts the significant increment in the miss rate as the block size increases, due to false sharing.

[[Image:miss rate vs block size.jpg]]

Different data placement optimizations have been suggested to improve the miss rate due to false sharing. They are listed below.

* '''SplitScalar:''' '''''Place scalar variables that cause false sharing in different blocks.'''''

* '''Heap Allocate:''' '''''Allocate shared space from different heap regions according to which processor request the space.''''' It is common for a slave process to access the shared space that it requests itself. If no action is taken, the space allocated by different processes may share the same cache block and lead to false sharing.

* '''Expand Record:''' '''''Expand records in an array (padding with dummy words) to reduce the sharing of a cache block by different records.''''' While successful prefetching may occur within a record or across records, false sharing usually occurs across records, when more than one of them share the same cache block.

* ''' Align Record:''' '''''Choose a layout for arrays of records that minimizes the number of blocks the average record spans.''''' This optimization maximizes prefetching of the rest of the record when one word of a record is accessed, and may also reduce false sharing.

* '''Lockscalar:''' '''''Place active scalars that are protected by a lock in the same block as the lock variable.''''' As a result, the scalar is prefetched when the lock is accessed.

False sharing is caused by a mismatch between the memory layout of the write-shared data and the cross-processor memory reference pattern to it. Manually changing the placement of this data to better conform to the memory reference pattern can reduce false sharing up to 75%.

=== Reducing False Sharing through Compile Time Data Transformations ===

Here an effort has been made to introduce the reader to the concepts of compile time data transformations to achieve reduced false sharing in multiprocessors. Please refer to the Paper by Tor E. Jeremiassen and Susan J. Eggers for a detailed explanation of the algorithms.

A series of compiler directed algorithms and a suite of transformations can be employed that restructure the shared data at compile time. These algorithms analyze explicitly parallel programs, producing information about their cross-processor memory reference patterns that identifies data structures susceptible to false sharing and then chooses appropriate transformations to eliminate it.

The compiler analysis comprises of three stages

* Determine the section of code each process executes by computing its control flow graph (per-process control flow analysis).
* Perform non-concurrency analysis by examining the barrier synchronization pattern of the program, delineating the phases that cannot execute in parallel and computing the flow of control between them.
* Perform an summary side-effect analysis on a per-process basis for each phase (determined in stage two).

The per-processor control flow analysis and summary side-effect analysis (Stages 1 and 3 respectively) yields the sections of shared data that each processors reads and writes.

The non-concurrency analysis (stage – 2) uses synchronization points to determine which portions of the code can execute in parallel and which cannot.

In order to reduce the number of false sharing misses, data must be restructured so that:
* Data that is only, or overwhelmingly, accessed by one processor is grouped together.
* Write shared data objects with no processor locality do not share cache lines.

Two transformations have been devised to achieve the above to conditions.
* '''''group and transpose''''' – To address condition 1.
* '''''padding''''' – To address condition 2.

'''''group and transpose''''' physically group data together by changing the layout of the data structures in memory. If each processor’s data is less than the cache block size, it may be padded so that no two processors’ data is share a cache block. In addition to avoiding false sharing this also improves spatial locality.

The second transformation, '''''padding''''' pads data that is falsely shared in the short term but eventually write shared by all processors over time.

The speedups achieved with and without compile time data transformations for a few test programs are given below.
[[Image:Plots.jpg]]
==''References''==

[http://iacoma.cs.uiuc.edu/iacoma-papers/false_sharing.pdf]
False Sharing and Spatial Locality in Multiprocessor Caches
Josep Torrellas, Member, IEEE, Mbnica S. Lam, Member, IEEE, and John L. Hennessy, Fellow, IEEE

[http://www.eecs.berkeley.edu/Pubs/TechRpts/1999/CSD-99-1064.pdf]
Analysis of Shared Memory Misses and Reference Patterns
Jeffrey B. Rothman and Alan Jay Smith

[http://coblitz.codeen.org:3125/citeseer.ist.psu.edu/cache/papers/cs/1663/ftp:zSzzSzftp.cs.washington.eduzSztrzSz1994zSz09zSzUW-CSE-94-09-05.pdf/jeremiassen94reducing.pdf]
Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations.
Tor E. Jeremiassen and Susan J. Eggers

CSC/ECE 506 Fall 2007/wiki3 1 satkar

2007-10-18T00:11:05Z

Kperi:

CSC/ECE 506 Fall 2007/wiki3 1 satkar

2007-10-17T23:55:32Z

Kperi:

Strategies to combat “False Sharing”

2007-10-17T23:53:31Z

CSC/ECE 506 Fall 2007/wiki3 1 satkar

2007-10-17T20:55:02Z

Kperi:

CSC/ECE 506 Fall 2007/wiki3 1 satkar

2007-10-17T20:52:22Z

Kperi:

CSC/ECE 506 Fall 2007/wiki3 1 satkar

2007-10-17T20:41:08Z

Kperi:

== ''Introduction'' ==

The cache organization plays a key role in the modern computers, especially in the multiprocessors. The cache misses are broadly categorized into “Three-Cs”, namely Compulsory misses, Capacity misses and the Conflict misses. There is yet another category of misses introduced by the cache coherent multiprocessors, called the coherence misses. These occur when blocks of data are shared among multiple caches, and are of two types;

• '''True sharing''': When a data word produced by a processor is used by another processor, then it is said to be True Sharing.

• '''False Sharing''': When independent data words for different processors are placed in the same block, then it is called false sharing.

Increasing the size of the line in the cache helps in reducing the hit time, as more blocks can be accommodated in the same line. However, long cache lines may cause false sharing, when different processors access different words in the same cache line. In essence, they share the same line, without truly sharing the accessed data.

==''Problem with False Sharing'' ==
In the multiprocessors, accessing a memory location causes a slice of actual memory (a cache line) containing the memory location requested to be copied into the cache.Subsequent references to the same memory location or those around it can probably be satisfied out of the cache until the system determines it is necessary to maintain the coherency between cache and memory and restore the cache line back to memory.

But, in a scenario, where multiple processors try to update individual elements in the same cache line, leads to invalidation of entire cache line, even though the updates are independent of each other.Each update of an individual element of a cache line marks the line as invalid. Hence, other processors accessing a different element in the same line see the line marked as invalid. They are forced to fetch a fresh copy of the line from memory, even though the element accessed has not been modified. This is because cache coherency is maintained on a cache-line basis, and not for individual elements. As a result there will be an increase in interconnect traffic and overhead. Also, while the cache-line update is in progress, access to the elements in the line is inhibited.

This situation is called false sharing, and might become a bottleneck in the path of performance and scalability.

== ''Strategies to combat “False Sharing”'' ==

2007-09-06T03:42:19Z

Kperi:

[http://en.wikipedia.org/wiki/Blade_server] Wikipedia

[http://www.compactpci-systems.com/columns/software_corner/pdfs/3.03.pdf] www.compactpci-systems.com

[http://www.bladeserverscenter.com/i_technology.shtml] www.bladeserverscenter.com

[http://www.blade.org/techover.cfm] www.blade.org

[http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] www.hp.com

[http://www.terian.com/terianprods.asp?s=Blades] www.terian.com

[http://www.cisco.com/application/pdf/en/us/guest/netsol/ns500/c654/cdccont_0900aecd804ab4ce.pdf]www.cisco.com

[http://www.hpcx.ac.uk/support/training/MPP.html]www.hpcx.ac.uk

David E. Culler, Jaswinder Pal Singh, with Anoop Gupta,
Parallel Computer Architecture: A Hardware/Software Approach, © 1999 Morgan-Kauffman

Modular Systems: The Evolution of Reliability
White Paper #76 by Neil Rasmussen Suzanne Niles

File:Message Passing2.jpg

2007-09-06T03:41:11Z

Kperi:

1.Message passing

2007-09-06T03:41:02Z

Kperi:

1.Message passing

2007-09-06T03:40:26Z

Kperi:

== Introduction ==

Parallel programming requires interaction between the various processes that are simultaneously run on the individual processors and this is enabled by passing messages between the various processors. This important class of parallel machines, called Message-passing architectures, employs complete computers as building blocks including the microprocessor memory and the I/O system and provides communication between processors as explicit I/O operations. This style of architecture has much in common with the network of workstations, or clusters, except that the packaging of nodes is typically much tighter and the network is of much higher capability than a standard local area network.

The world's largest supercomputers are used almost exclusively to run applications which are parallelised using Message Passing. The course covers all the basic knowledge required to write parallel programs using this programming model, and is directly applicable to almost every parallel computer architecture.

Parallel programming by definition involves co-operation between processes to solve a common task. The programmer has to define the tasks that will be executed by the processors, and also how these tasks are to synchronise and exchange data with one another. In the message-passing model the tasks are separate processes that communicate and synchronise by explicitly sending each other messages. All these parallel operations are performed via calls to some message-passing interface that is entirely responsible for interfacing with the physical communication network linking the actual processors together.

[[Image:Message Passing.jpg]]

== Message Passing ==

In message passing, a substantial distance exists between the programming model and the actual hardware primitives, with user communication performed through operating systems or library calls that perform the low-level actions including the actual communication operation. The most common user-level communication operations on message passing are variants of the send and receive. In its simplest form send specifies a local data buffer that is to be transmitted and a receiving process(typically on a remote processor).Receive specifies a sending process and a local data buffer into which the transmitted data is to be placed.together a matching send and receive causes a data transfer from one processor to another.In most message passing systems, the send process also allows an identifier or tag to be attached to the message, and the receiving operation specifies a matching rule( such as a specific tag from a specific processor)

[[Image:Message Passing.jpg]]

The combination of a send and a matching receive accomplishes a memory to memory copy, where each end specifies its local data address, and a pair wise synchronization event. There are several possible variants of this synchronization event, depending upon whether the send completes when the receive has been executed, when the send buffer is available for reuse, or when the request has been accepted. Similarly, the receive can potentially wait until a matching send occurs or simply post the receive. Each of these variants have somewhat different semantics and different implementation requirements. Message passing has long been used as a means of communication and synchronization among arbitrary collections of cooperating sequential processes, even on a single processor. Important examples include programming languages, such as CSP and Occam, and common operating systems functions, such as sockets. Parallel programs using message passing are typically quite structured, like their shared-memory counter parts. Most often, all nodes execute identical copies of a program, with the same code and private variables. Usually, processes can name each other using a simple linear ordering of the processes comprising a program.

== Typical Structure ==

Early message passing machines provided hardware primitives that were very close to the simple send/receive user-level communication abstraction, with some additional restrictions. A node was connected to a fixed set of neighbors in a regular pattern by point-to-point links that behaved as simple FIFOs. Most early machines were hypercubes, where each node is connected to n other nodes differing by one bit in the binary address, for a total of 2^n nodes, or meshes, where the nodes are connect to neighbors on two or three dimensions. The network topology was especially important in the early message passing machines, because only the neighboring processors could be named in a send or receive operation. The data transfer involved the sender writing into a link and then writing the message until the receiver started reading it, so the send would block until the receive occurred. In modern terms this is called synchronous message passing because the two events coincide in time. The details of moving data were hidden from the programmer in a message passing library, forming a layer of software between send and receive calls and the actual hardware.

[[Image:Hypercube.jpg]] 
Typical structure of an early message passing machines

The direct FIFO design was soon replaced by more versatile and more robust designs which provided direct memory access (DMA) transfers on either end of the communication event. The use of DMA allowed non-blocking sends, where the sender is able to initiate a send and continue with useful computation (or even perform a receive) while the send completes. On the receiving end, the transfer is accepted via a DMA transfer by the message layer into a buffer and queued until the target process performs a matching receive, at which point the data is copying into the address space of the receiving process. The physical topology of the communication network dominated the programming model of these early machines and parallel algorithms were often stated in terms of a specific interconnection topology, e.g., a ring, a grid, or a hypercube. However, to make the machines more generally useful, the designers of the message layers provided support for communication between arbitrary processors, rather than only between physical neighbors. This was originally supported by forwarding the data within the message layer along links in the network. Soon this routing function was moved into the hardware, so each node consisted of a processor with memory, and a switch that could forward messages, called a router. However, in this store and forward approach the time to transfer a message is proportional to the number of hops it takes through the network, so there remained an emphasis on interconnection topology.

The emphasis on network topology was significantly reduced with the introduction of more general purpose networks, which pipelined the message transfer through each of the routers forming the interconnection network. In most modern message passing machines, the incremental delay introduced by each router is small enough that the transfer time is dominated by the time to simply move that data between the processor and the network, not how far it travels.This greatly simplifies the programming model; typically the processors are viewed as simply forming a linear sequence with uniform communication costs.A processor in a message passing machine can name only the locations in its local memory, and it can name each of the processors, perhaps by number or by route. A user process can only name private addresses and other processes; it can transfer data using the send/receive calls.

2.Blade Servers

2007-09-06T03:35:09Z

Kperi:

== Introduction ==

Blade servers are a revolutionary new concept for enterprise applications currently using a “stack of PC servers” approach. Blade servers promise to greatly increase compute density, reduce cost, improve reliability, and simplify cabling. Companies such as Dell, Hewlett Packard, IBM, RLX, and Sun offer blade server solutions that reduce operating expense while increasing services density. Blade servers form the basis for a modular computing paradigm.

== Evolution ==
For many years, traditional standalone servers grew larger and faster, taking on more and more tasks as networked computing expanded. New servers were added to data centers as the need arose, often as a quick fix with little coordination or planning; it was not unusual for data center operators to discover that servers had been added without their knowledge. The resulting complexity of boxes and cabling became a growing invitation to confusion, mistakes, and inflexibility.

[[Image:Conventional Servers.jpg]]
Figure : Conventional Servers
Blade servers, first appearing in 2001, are a very simple and pure example of modular architecture – the blades in a blade server chassis are physically identical, with identical processors, ready to be configured and used for any purpose desired by the user. Their introduction brought many benefits of modularity to the server landscape – scalability, ease of duplication, specialization of function, and adaptability.Blade servers were developed in response to a critical and growing need in the datacenter: the requirement to increase server performance and availability without dramatically increasing the size, cost and management complexity of an ever growing data center. To keep up with user demand and because of the space and power demands of traditional tower and rackmount servers, data centers are being forced to expand their physical plant at an alarming rate.

[[Image:Blade Server.jpg]]

But while these classic modular advantages have given blade servers a growing presence in data centers, their full potential awaits the widespread implementation of one remaining critical capability of modular design: fault tolerance. Fault tolerant blade servers – ones with built-in “failover” logic to transfer operation from failed to healthy blades – have only recently started to become available and affordable. The reliability of such fault tolerant servers will surpass that of current techniques involving redundant software and clusters of single servers, putting blade servers in a position to become the dominant server architecture of data centers. With the emergence of automated fault tolerance, industry observers predict rapid migration to blade servers over the forthcoming years.

The Terian EdgeXPS® 714-132 is powered by the latest Dual-Core Intel® Xeon® 5100 Series Processor and can support up to 14- dual processor Blade Servers (28 total processors) in a single 7U chassis. 

[[Image:Terian EdgeXPS® 714-132 .jpg]]

== General blade server architecture ==

A general blade server architecture is shown in the figure. The hardware components of a blade server are the switch blade, chassis (with fans, temperature sensors, etc), and multiple compute blades. Some vendors offer, partner, or plan to partner with companies that provide application specific blades that provide traffic
conditioning, protection, or network processing prior to the traffic reaching the compute blades. Often, these application specific
blades may be functionally positioned between the switch blade and compute blades. However, these blades reside in a standard
compute blade slot.

[[Image:Figure 1.jpg]]

The outside world connects through the rear of the chassis to a switch card in the blade server. The switch card is provisioned to
distribute packets to blades within the blade server. All these components are wrapped together with network management system
software provided by the blade server vendor. The specifics on the blade server architecture vary from vendor to vendor. But before
you discount this as a bunch of proprietary architectures, think again. Remember that IBM and others dramatically advanced and
proliferated the PC architecture, changing the face of computing forever.

The blade server industry appears to be headed in the same direction. There are some areas where standardization of blade
server components will prove helpful. However, blade server vendors ability to quickly adapt and advance their architectures to
suite specific applications unencumbered by the standards process will prove to accelerate proliferation in the near term.

== Blade Enclosure ==

The enclosure (or chassis) performs many of the non-core computing services found in most computers. Non-blade computers require components that are bulky, hot and space-inefficient, and duplicated across many computers that may or may not be performing at capacity. By locating these services in one place and sharing them between the blade computers, the overall utilization is more efficient. The specifics of which services are provided and how vary by vendor.

'''Power'''

Computers operate over a range of DC voltages, yet power is delivered from utilities as AC, and at higher voltages than required within the computer. Converting this current requires power supply units (or PSUs). To ensure that the failure of one power source does not affect the operation of the computer, even entry-level servers have redundant power supplies, again adding to the bulk and heat output of the design.

The blade enclosure's power supply provides a single power source for all blades within the enclosure. This single power source may be in the form of a power supply in the enclosure or a dedicated separate PSU supplying DC to multiple enclosures [1]. This setup not only reduces the number of PSUs required to provide a resilient power supply, but it also improves efficiency because it reduces the number of idle PSUs. In the event of a PSU failure the blade chassis throttles down individual blade server performance until it matches the available power. This is carried out in steps of 12.5% per CPU until power balance is achieved.

'''Cooling'''

During operation, electrical and mechanical components produce heat, which must be displaced to ensure the proper functioning of the components. In blade enclosures, as in most computing systems, heat is removed with fans.

A frequently underestimated problem when designing high-performance computer systems is the conflict between the amount of heat a system generates and the ability of its fans to remove the heat. The blade's shared power and cooling means that it does not generate as much heat as traditional servers. Newer blade enclosure designs feature high speed, adjustable fans and control logic that tune the cooling to the systems requirements.[2]

At the same time, the increased density of blade server configurations can still result in higher overall demands for cooling when a rack is populated at over 50%. This is especially true with early generation blades. In absolute terms, a fully populated rack of blade servers is likely to require more cooling capacity than a fully populated rack of standard 1U servers.

'''Networking'''

Computers are increasingly being produced with high-speed, integrated network interfaces, and most are expandable to allow for the addition of connections that are faster, more resilient and run over different media (copper and fiber). These may require extra engineering effort in the design and manufacture of the blade, consume space in both the installation and capacity for installation (empty expansion slots) and hence more complexity. High-speed network topologies require expensive, high-speed integrated circuits and media, while most computers do not utilise all the bandwidth available.

The blade enclosure provides one or more network buses to which the blade will connect, and either presents these ports individually in a single location (versus one in each computer chassis), or aggregates them into fewer ports, reducing the cost of connecting the individual devices. These may be presented in the chassis itself, or in networking blades[3].

'''Storage'''

While computers typically need hard-disks to store the operating system, application and data for the computer, these are not necessarily required locally. Many storage connection methods (e.g. FireWire, SATA, SCSI, DAS, Fibre Channel and iSCSI) are readily moved outside the server, though not all are used in enterprise-level installations. Implementing these connection interfaces within the computer presents similar challenges to the networking interfaces (indeed iSCSI runs over the network interface), and similarly these can be removed from the blade and presented individually or aggregated either on the chassis or through other blades.

The ability to boot the blade from a storage area network (SAN) allows for an entirely disk-free blade. This may have higher processor density or better reliability than systems having individual disks on each blade.

== Advantages of Blade Servers ==

'''Reduced Space Requirements''' - Greater density provides up to 35 to 45 percent improvement compared to tower or rackmounted servers. 

'''Reduced Power Consumption and Improved Power Management''' - consolidating power supplies into the blade chassis reduces the number of separate power supplies needed and reduces the power requirements per server. 

'''Lower Management Cost''' - server consolidation and resource centralization simplifies server deployment, management and administration and improves management and control. 

''' Simplified Cabling''' - rack mount servers, while helping consolidate servers into a centralized location, create wiring proliferation. Blade servers simplify cabling requirements and reduce wiring by up to 70 percent. Power cabling, operator wiring (keyboard, mouse, etc.) and communications cabling (Ethernet, SAN connections, cluster connection) are greatly reduced. 

'''Future Proofing Through Modularity''' - as new processor, communications, storage and interconnect technology becomes available, it can be implemented in blades that install into existing equipment, upgrading server operation at a minimum cost and with no disruption of basic server functionality. 

''' Easier Physical Deployment''' - once a blade server chassis has been installed, adding additional servers is merely a matter of sliding in additional blades into the chassis. Software management tools simplify the management and reporting functions for blade servers. Redundant power modules and consolidated communication bays simplify integration into datacenters and increase reliability.

== Are blade servers an extension of message passing ? ==

Blade servers use message passing in order to achieve fast and efficient performance. Parallel computing frequently relies upon message passing to exchange information between computational units. In high-performance computing, the most common message passing technology is the '''Message Passing Interface (MPI)''', which is being developed in an open-source implementation supported by Cisco Systems® and other vendors.

High performance computing (HPC) Cluster applications require a high performance interconnect for blade servers to achieve fast and efficient performance for computation-intensive applications.When messages are passed between nodes , some time is spent transmitting these messages, and depending on the frequency of the data synchronization between processes, that factor can have a significant effect on total application run time. It is critically important to understand how the application works with respect to interprocess communications patterns and the frequency of updates, because these affect the performance and design of the parallel application, the design of the interconnecting network, and the choice of network technology.

Using traditional transport protocols such as TCP/IP, the CPU is responsible for managing how data is moved between I/O memory and
for transport protocol processing. The effect of this is that time spent in communicating between nodes is time not spent on processing the application. Therefore, minimizing communications time is a key consideration for certain classes of applications.

MPI is “middleware” software that sits between the application and the network hardware. It provides a portable mechanism to enable messages to be exchanged between processes regardless of the underlying network or parallel computational environment. As such,implementations of the MPI standard use underlying communications stacks such as TCP or UDP over IP, InfiniBand, or Myrinet to communicate between processes. MPI offers a rich set of functions that can be combined in simple or complex ways to solve any type of parallel computation. The ability to exchange messages enables instructions or data to be passed between nodes to distribute data sets for calculation. MPI has been implemented on a wide variety of platforms, operating systems, and cluster and supercomputer architectures.

See Also [http://h41112.www4.hp.com/promo/blades-community/eur/en/library/articles/Both_worldspdf.pdf] '''The best of both worlds
'''