Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T01:09:38Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

In an interconnection network of p nodes with Hypercube topology, the degree of the network would be log2p and the diameter of the network would also be log2p. The total number of links in the network would be (p/2)*log2p and the bisection bandwidth of the network would be p/2.

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

In a k-ary Tree interconnection network with p nodes the degree of the nodes k+1 and the total number of links is k*(p-1). The bisection bandwidth of the network is 1 and the diameter of the network is 2*(logkp).

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]


In an interconnection network with k-ary Fat Tree implementation of p nodes, the degree of the network would be 'k+1' and the total number of links would be k*(p-1). The diameter of the network would be 2*logkp, the bisection bandwidth of the network would be p/2.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

In a butterfly interconnection network topology with p nodes, the degree of the network would be 4 and the total number of links in the network would be 2*p(log2p). The diameter of the network is log2p. The bisection bandwidth of the network is p/2.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T01:05:38Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

In an interconnection network of p nodes with Hypercube topology, the degree of the network would be log2p and the diameter of the network would also be log2p. The total number of links in the network would be (p/2)*log2p and the bisection bandwidth of the network would be p/2.

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]


In an interconnection network with k-ary Fat Tree implementation of p nodes, the degree of the network would be 'k+1' and the total number of links would be k*(p-1). The diameter of the network would be 2*logkp, the bisection bandwidth of the network would be p/2.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

In a butterfly interconnection network topology with p nodes, the degree of the network would be 4 and the total number of links in the network would be 2*p(log2p). The diameter of the network is log2p. The bisection bandwidth of the network is p/2.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T01:01:37Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

In an interconnection network of p nodes with Hypercube topology, the degree of the network would be log2p and the diameter of the network would also be log2p. The total number of links in the network would be (p/2)*log2p and the bisection bandwidth of the network would be p/2.

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

In a butterfly interconnection network topology with p nodes, the degree of the network would be 4 and the total number of links in the network would be 2*p(log2p). The diameter of the network is log2p. The bisection bandwidth of the network is p/2.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:48:07Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

In an interconnection network of p nodes with Hypercube topology, the degree of the network would be log2p and the diameter of the network would also be log2p. The total number of links in the network would be (p/2)*log2p and the bisection bandwidth of the network would be p/2.

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:45:38Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

In an interconnection network of p nodes with Hypercube topology, the degree of the network would be log2p

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:38:44Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

In a 2-D Mesh interconnection network with p nodes, the degree of the network would be 4 and the total number of links would be 2*sqrt(p)(sqrt(p) -1) . The diameter of the network would be 2*(sqrt(p) - 1). The bisection bandwidth of the interconnection is the total number of links divided by the diameter of the network, which would result to be sqrt(p).

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:32:51Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:32:00Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

In a Ring interconnection network with p nodes, the total number of links in the network would be and the degree of the interconnection network is 2. The maximum distance between nodes in the network would be the distance between nodes that are half way through the network, hence the diameter would be p/2 . The bisection bandwidth of the network of the interconnection would thus become 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:26:23Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]


In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:26:07Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

In a linear array interconnection network with '''p''' nodes, the total number of links would be p-1 and the degree of the network would be 2 . The total link bandwidth is p-1 times the link bandwidth, but bisection bandwidth is equal to one link bandwidth. Since global communication must always travel through one link, bisection bandwidth summarizes the bandwidth characteristic of the network better than the aggregate bandwidth.

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:07:21Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-19T00:00:01Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.



Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.



In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-18T23:57:53Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.

 

Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''. The minimum amount of data that can be transmitted in one cycle is called a '''phit'''. A phit is typically determined by to the width of the link. However, data is transferred at the granularity of link-level flow control unit called a '''flit'''. A flit worth of data can be accepted or rejected at the receiver, depending on the flow control protocol and amount of buffering available at the receiver.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. A link, the receiver and sender, make up a channel. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


 
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-18T23:46:28Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. A link can be unidirectional, in which data can only be sent in one direction, or bidirectional in which data can be sent in both directions. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


 
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-18T23:43:51Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a mul
ti-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


 
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth. 
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network. 
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-18T22:22:38Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch12 ar

2011-04-18T22:21:23Z

Vrmanda:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple and low cost. However, it has several shortcoming. First of all, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes. In addition to not scaling well, this topology can also result in high congestion. [[#References|[8]]]

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. This topology will scale better since the longest distance between two nodes is cut in half, but it eventually ends up scaling poorly if enough nodes are added. The congestion will also be cut in half since there is now 2 options for packets to traverse.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4. This topology has reasonably low energy dissipation without compromising throughput. The area overhead for this topology is also rather low since a node is never farther than a clock cycle away, this means there is no need for repeater insertion in between the nodes. (the repeaters have a high area overhead, so using them causes the area of the topology to increase) [[#References|[8]]]

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh. This topology continues to improve upon the throughput in the 2-D mesh, however the power dissipation will be slightly higher.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together. This topology was developed in 1983. Originally there were no routers. The next hops were just programmed into each node. Today the hypercube topology is used by many companies including Intel. It is so attractive because of its small diameter. The nodes are numbered in such a way that every neighboring node is only one bit difference. This greatly increases the ability to route messages through the network. The biggest drawback of the topology is the lack of scalability. For example if the dimension size is increased by one, one link will need to be added to every node in the network. [[#References|[10]]]

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Thus it slightly improves upon the throughput of the 2-D Mesh, but it also slightly increases the power dissipation relative to the 2-D Mesh since the number of links is higher. The delays on the routes connecting the end nodes together can have an excessively high delays if the topology is not implemented correctly.(printout) This topology was developed in 1985, because of the design constraints, such as pins and bisection, that the hypercube required. [[#References|[9]]]

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree has a hierarchical structure of nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic and high latency if the package must travel through the upper levels. Also because of the high connectivity, this topology has high average energy dissipation.[[#References|[8]]]

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
In 1985, an MIT professor invented the fat tree to improve upon the normal “skinny” tree. To improve upon the “skinny” tree topology, the fat tree “fattens” up the links at the upper levels. This helps to alleviate the traffic at upper levels and to decrease the latency of the message. However, by fattening, it is meant that additional links are added to this area, which will increase the average energy dissipated by this topology. [[#References|[11]]]

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
Another attempt to increase the “skinny” tree structure was the butterfly structure. The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels. There are 2 problems with this topology. First of all, there is no path diversity in this topology. There is only one path from the root to a downstream node. This is not ideal incase the network is congested in a certain area, but available in another. There is no way for the network to rebalance the work. Second of all, there are some very long routes in this topology. This requires there to be repeaters in between the nodes, which will cause the physical area to implement the network to increase dramatically. [[#References|[12]]]

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.




Importance and Usage of 2D Mesh Networks:
Various metrics that are to be considered in coming up with a decision on the interconnection networks are the following factors:
1. Diameter : The maximum distance in the network is defined as the diameter. And to compute the average distance, a list of all pairs of nodes is to be made and the average of those would determine the value.
2. Bisection Bandwidth : A network can be partitioned in any number of ways. When a network is divided in such a fashion that it has two equal partitions then the minimum number of links that need to be cut is defined as the Bisection Bandwidth.
3. No. of Links : Number of links in a network is the set of wires that connect two different nodes in the network.
4. Degree : Number of input/output links connecting to each router is defined as the degree of the network.

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://cial.csie.ncku.edu.tw/presentation/group_pdf/%5BY2005%5DPerformance%20Evaluation%20and%20Design%20Trade-Offs%20For%20Network-on-Chip%20Interconnect%20Architectures.pdf Interconnection Network Topology Tradeoffs]
 
9 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Evolution of Interconnection Networks]
 
10 [http://www2.engr.arizona.edu/~hpcat/papers/jlt94.pdf Hypercube pros and cons]
 
11 [http://courses.csail.mit.edu/6.896/spring04/handouts/papers/fat_trees.pdf History of the Fat Tree Topology]
 
12 [http://books.google.com/books?id=uyAg3zu_DYMC&pg=PA75&lpg=PA75&dq=butterfly+interconnection+network+topology&source=bl&ots=5tQm-TVTu0&sig=oKMwSlpqDJqopnYrKv0BbA8k5uU&hl=en&ei=5cmrTYezINDQiAL9uKjvDA&sa=X&oi=book_result&ct=result&resnum=8&ved=0CFMQ6AEwBw Pros and Cons of the Butterfly Interconnection Network]

 
 
NOTE: This wiki is based off of a [http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC previous wiki]

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:46:16Z

Vrmanda: /* Comparison with Message Passing and Shared Memory */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

== Synchronous vs Asynchronous ==
While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.

== Determinism vs. Non-Determinism ==
Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).

== Major differences between data parallel and task parallel models can broadly be classified as the following ==

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.
* Philip J. Hatcher, Michael Jay Quinn, ''Data-Parallel Programming on MIMD Computers'', The MIT Press, 1991.
* Blaise Barney, "Introduction to Parallel Computing: Data Parallel Model", Lawrence Livermore National Laboratory, [https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData], January 2009.
* Guy Blelloch, "Is Parallel Programming Hard?", Carnegie Mellon University, [http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard], April 2009.
* Björn Lisper, ''Data parallelism and functional programming'', Lecture Notes in Computer Science, Volume 1132/1996, pp. 220-251, Springer Berlin, 1996.
* ''SIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SIMD http://en.wikipedia.org/wiki/SIMD].
* ''MIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/MIMD http://en.wikipedia.org/wiki/MIMD].
* ''Lockstep'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/Lockstep_(computing) http://en.wikipedia.org/wiki/Lockstep_(computing)].
* ''SPMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SPMD http://en.wikipedia.org/wiki/SPMD].

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:43:10Z

Vrmanda: /* References */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

== Synchronous vs Asynchronous ==
While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.

== Determinism vs. Non-Determinism ==
Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).

== Major differences between data parallel and task parallel models can broadly be classified as the following ==

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.
* Philip J. Hatcher, Michael Jay Quinn, ''Data-Parallel Programming on MIMD Computers'', The MIT Press, 1991.
* Blaise Barney, "Introduction to Parallel Computing: Data Parallel Model", Lawrence Livermore National Laboratory, [https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData], January 2009.
* Guy Blelloch, "Is Parallel Programming Hard?", Carnegie Mellon University, [http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard], April 2009.
* Björn Lisper, ''Data parallelism and functional programming'', Lecture Notes in Computer Science, Volume 1132/1996, pp. 220-251, Springer Berlin, 1996.
* ''SIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SIMD http://en.wikipedia.org/wiki/SIMD].
* ''MIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/MIMD http://en.wikipedia.org/wiki/MIMD].
* ''Lockstep'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/Lockstep_(computing) http://en.wikipedia.org/wiki/Lockstep_(computing)].
* ''SPMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SPMD http://en.wikipedia.org/wiki/SPMD].

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:42:52Z

Vrmanda: /* References */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

== Synchronous vs Asynchronous ==
While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.

== Determinism vs. Non-Determinism ==
Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).

== Major differences between data parallel and task parallel models can broadly be classified as the following ==

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.
* Philip J. Hatcher, Michael Jay Quinn, ''Data-Parallel Programming on MIMD Computers'', The MIT Press, 1991.
* Blaise Barney, "Introduction to Parallel Computing: Data Parallel Model", Lawrence Livermore National Laboratory, [https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData https://computing.llnl.gov/tutorials/parallel_comp/#ModelsData], January 2009.
* Guy Blelloch, "Is Parallel Programming Hard?", Carnegie Mellon University, [http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard http://www.cilk.com/multicore-blog/bid/9108/Is-Parallel-Programming-Hard], April 2009.
* Björn Lisper, ''Data parallelism and functional programming'', Lecture Notes in Computer Science, Volume 1132/1996, pp. 220-251, Springer Berlin, 1996.
* ''SIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SIMD http://en.wikipedia.org/wiki/SIMD].
* ''MIMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/MIMD http://en.wikipedia.org/wiki/MIMD].
* ''Lockstep'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/Lockstep_(computing) http://en.wikipedia.org/wiki/Lockstep_(computing)].

10) ''SPMD'', Wikipedia, the free encyclopedia, [http://en.wikipedia.org/wiki/SPMD http://en.wikipedia.org/wiki/SPMD].

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:40:23Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

== Synchronous vs Asynchronous ==
While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.

== Determinism vs. Non-Determinism ==
Data parallelism's synchronous nature and task parallelism's asynchronism give rise to another pair of features that add to the difference between these two models: determinism versus non-determinism. Data parallelism is deterministic, i.e. computing with the same input will always yield the same result, since its synchronism ensures that issues like relative timing between PEs will not arise. In contrast, task parallelism's asynchronous updates of common data can give rise to non-determinism, i.e, the same input won't always yield the same computation result (the result of a computation will depend also on factors outside the program control, such as scheduling and timing of other PEs). Obviously, non-determinism makes it harder to write and maintain correct programs. This partially explains the advantage of data parallel programming model over data parallelism in terms of development effort (also discussed in section 4.2).

== Major differences between data parallel and task parallel models can broadly be classified as the following ==

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:39:09Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

== Synchronous vs Asynchronous ==
While the [http://en.wikipedia.org/wiki/Lockstep_(computing) lockstep] imposed by data parallelism on all data streams ensures synchronous computation (all PEs perform their tasks at the exact same pace), every processor in task parallelism performs its task at their own pace, which we call asynchronous computation. Thus, at a certain point of a task parallel program's execution, communication and synchronization primitives are needed to allow different instruction streams to coordinate their efforts, and that is where variable-sharing and message-passing come into play.

Major differences between data parallel and task parallel models can broadly be classified as the following:

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:38:27Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

Major differences between data parallel and task parallel models can broadly be classified as the following:

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:35:51Z

Vrmanda: /* Comparison with Message Passing and Shared Memory */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:34:44Z

Vrmanda: /* Definitions */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

* ''Data parallel.'' A data parallel algorithm is composed of a set of identical tasks which operate on different subsets of common data.
* ''Task parallel.'' A task parallel algorithm is composed of a set of differing tasks which operate on common data.
* ''SIMD (single-instruction-multiple-data).'' A processor which executes a single instruction simultaneously on multiple data locations.
* '' MIMD (multiple-instruction-multiple-data).'' A processor architecture which can execute multiple instruction across multiple data elements simultaneously.

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:33:02Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-31T02:32:17Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

{| class="wikitable" border="1" align="center"
|+ '''Comparison between data parallel and task parallel programming models.'''
|-
! Aspects
! Data Parallel
! Task Parallel
|-
| Decomposition
| Partition data into subsets
| Partition program into subtasks
|-
| Parallel tasks
| Identical
| Unique
|-
| Degree of parallelism
| Scales easily
| Fixed
|-
| Load balancing
| Easier
| Harder
|-
| Communication overhead
| Lower
| Higher
|}

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:10:10Z

Vrmanda:

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"

...

if CPU ="n" then
do task "N"

end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:08:53Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:06:04Z

Vrmanda: /* Data Parallel Model vs Task Parallel Model */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow: there is only one control processor that directs the activities of all the processing elements. In stark contrast to this is task parallelism (MIMD: Multiple Instruction, Multiple Data): characterized by its multiple control flows, it allows the concurrent execution of multiple instruction streams, each manipulates its own data and services separate functions. Below is a contrast between the data parallelism and task parallelism models from wikipedia: [http://en.wikipedia.org/wiki/SIMD SIMD] and [http://en.wikipedia.org/wiki/MIMD MIMD]. In the following subsections we continue to compare and contrast different features of data-parallel model and task-parallel model to help reader understand the unique characteristics of data-parallel programming model.
[[Image:Smid.png|frame|center|425px|contrast between data parallelism and task parallelism]]

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:04:07Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
<pre>program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program
</pre>
If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:03:01Z

Vrmanda: /* Comparison with Message Passing and Shared Memory */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

//The pseudo code below illustrates task parallelism:
program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:02:06Z

Vrmanda: /* Comparison with Message Passing and Shared Memory */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

//The pseudo code below illustrates task parallelism:
program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:01:20Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

//The pseudo code below illustrates task parallelism:
program:
do
...

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T01:00:34Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

//The pseudo code below illustrates task parallelism:
program:

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:58:07Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
program:

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if

end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:57:36Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:55:47Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
If the task to be accomplished is to compute the sum of the results associated with the execution of instruction 'A' and instructions 'B'. The following example illustrates, how task parallelism can be achieved.

The pseudo code below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program

If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it accordingly.
In an SPMD system, both CPUs will execute the code. In a parallel environment, both will have access to the same data. The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task. Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

The <a href="/wiki/Pseudocode" title="Pseudocode">pseudocode</a> below illustrates task parallelism:
<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:47:11Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
In an SPMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

The <a href="/wiki/Pseudocode" title="Pseudocode">pseudocode</a> below illustrates task parallelism:
<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
In an SPMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:45:18Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
In an SPMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

The <a href="/wiki/Pseudocode" title="Pseudocode">pseudocode</a> below illustrates task parallelism:
<pre>
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
</pre>
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.

The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
In an SPMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:43:27Z

Vrmanda: /* Description and Example */

=Supplement to Chapter 2: The Data Parallel Programming Model=

= History =
As computer architectures have evolved, so have parallel programming models. The earliest advancements in parallel computers took advantage of bit-level parallelism. These computers used vector processing, which required a shared memory programming model. As performance returns from this architecture diminished, the emphasis was placed on instruction-level parallelism and the message passing model began to dominate. Most recently, with the move to cluster-based machines, there has been an increased emphasis on thread-level parallelism. This has corresponded to an increase interest in the data parallel programming model.

== Bit-level parallelism in the 1970's ==
The major performance improvements from computers during this time were due to the ability to execute 32-bit word size operations at one time ([[#References|Culler (1999), p. 15.]]). The dominant supercomputers of the time, like the Cray and the ILLIAC IV, were mainly Single Instruction Multiple Data architectures and used a shared memory programming model. They each used different forms of vector processing ([[#References|Culler (1999), p. 21.]]).
Development of the ILLIAC IV began in 1964 and wasn't finished until 1975 [http://en.wikipedia.org/wiki/ILLIAC_IV]. A central processor was connected to the main memory and delegated tasks to individual PE's, which each had their own memory cache. [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf]. Each PE could operate either an 8-, 32- or 64-bit operand at a given time [http://archive.computerhistory.org/resources/text/Burroughs/Burroughs.ILLIAC%20IV.1974.102624911.pdf].

The Cray machine was installed at Los Alamos National Laborartory in1976 by Control Data Corporation and had similar performance to the ILLIAC IV [http://en.wikipedia.org/wiki/ILLIAC_IV]. The Cray machine relied heavily on the use of registers instead of individual processors like the ILLIAC IV. Each processor was connected to main memory and had a number of 64-bit registers used to perform operations [http://www.eecg.toronto.edu/~moshovos/ACA05/read/cray1.pdf].

== Move to instruction-level parallelism in the 1980's ==

Increasing the word size above 32-bits offered diminishing returns in terms of performance ([[#References|Culler (1999), p. 15.]]). In the mid-1980's the emphasis changed from bit-level parallelism to instruction-level parallelism, which involved increasing the number of instructions that could be executed at one time ([[#References|Culler (1999), p. 15.]]). The message passing model allowed programmers the ability to divide up instructions in order to take advantage of this architecture.

== Thread-level parallelism ==
The move to cluster-based machines in the past decade, has added another layer of complexity to parallelism. Since computers could be located across a network from each other, there is more emphasis on software acting as a bridge [http://cobweb.ecn.purdue.edu/~pplinux/ppcluster.html]. This has led to a greater emphasis on thread- or task-level parallelism [http://en.wikipedia.org/wiki/Thread-level_parallelism] and the addition of the data parallelism programming model to existing message passing or shared memory models [http://en.wikipedia.org/wiki/Thread-level_parallelism].

= Data Parallel Model =
One important feature of data-parallel programming model or data parallelism (SIMD) is the single control flow. Flynn's taxonomy classifies SIMD to be analogous to doing the same operation repeatedly over a large data set. There is only one control processor that directs the activities of all the processing elements. In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved when each processor performs the same task on different pieces of distributed data. In some situations, a single execution thread controls operations on all pieces of data. In others, different threads control the operation, but they execute the same code.

== Description and Example ==

This section shows a simple example adapted from Solihin textbook (pp. 24 - 27) that illustrates the data-parallel programming model. Each of the codes below are written in pseudo-code style.

Suppose we want to perform the following task on an array <code>a</code>: updating each element of <code>a</code> by the product of itself and its index, and adding together the elements of <code>a</code> into the variable <code>sum</code>. The corresponding code is shown below.

// simple sequential task
sum = 0;
'''for''' (i = 0; i < a.length; i++)
{
a[i] = a[i] * i;
sum = sum + a[i];
}

When we orchestrate the task using the data-parallel programming model, the program can be divided into two parts. The first part performs the same operations on separate elements of the array for each processing element (sometimes referred to as PE or pe), and the second part reorganizes data among all processing elements (In our example data reorganization is summing up values across different processing elements). Since data-parallel programming model only defines the overall effects of parallel steps, the second part can be accomplished either through shared memory or message passing. The three code fragments below are examples for the first part of the program, shared-memory version of the second part, and message passing for the second part, respectively.

// data parallel programming: let each PE perform the same task on different pieces of distributed data
pe_id = getid();
my_sum = 0;
'''for''' (i = pe_id; i < a.length; i += number_of_pe) //separate elements of the array are assigned to each PE
{
a[i] = a[i] * i;
my_sum = my_sum + a[i]; //all PEs accumulate elements assigned to them into local variable my_sum
}

In the above code, data parallelism is achieved by letting each processing element perform actions on array's separate elements, which are identified using the PE's id. For instance, if three processing elements are used then one processing element would start at i = 0, one would start at i = 1, and the last would start at i = 2. Since there are three processing elements then the index of the array for each will increase by three on each iteration until the task is complete (note that in our example elements assigned to each PE are interleaved instead of continuous). If the length of the array is a multiple of three then each processing element takes the same amount of time to execute its portion of the task.

The picture below illustrates how elements of the array are assigned among different PEs for the specific case: length of the array is 7 and there are 3 PEs available. Elements in the array are marked by their indexes (0 to 6). As shown in the picture, PE0 will work on elements with index 0, 3, 6; PE1 is in charge of elements with index 1, 4; and elements with index 2, 5 are assigned to PE2. In this way, these 3 PEs work collectively on the array, while each PE works on different elements. Thus, data parallelism is achieved.

[[Image:506wiki1.png|frame|center|150px|Illustration of data parallel programming(adapted from [http://computing.llnl.gov/tutorials/parallel_comp/#ModelsData Introduction to Parallel Computing])]]

== Comparison with Message Passing and Shared Memory ==

All the differences that exist between a data parallel programming paradigm and task

Although the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data).
Access control is the

In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, Klaiber (1994) compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.
As discussed in the previous section, one of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given above were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.
Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. SIMD (single-instruction-multiple-data) processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include CUDA processors developed by nVidia and Cell processors developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the Appendix. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.
Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. However, data parallel code also requires writing code to split program data into chunks and assign it to different threads. In addition, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data. The regular data chunks also make it easier to reason about where to locate data and how to organize it.

= Task Parallel Model =
Task Parallelism is a form of parallelization where multiple instructions are executed either on same data or multiple data. It focuses on distributing execution of processes(threads) across different parallel computing nodes. As a part of workflow, different execution threads communicate with one another as they work to share data.

== Description and Example ==
The pseudocode below illustrates task parallelism:
program:
...
if CPU="a" then
do task "A"
else if CPU="b" then
do task "B"
end if
...
end program
The goal of the program is to do some net total task ("A+B"). If we write the code as above and launch it on a 2-processor system, then the runtime environment will execute it as follows.
In an SPMD system, both CPUs will execute the code.
In a parallel environment, both will have access to the same data.
The "if" clause differentiates between the CPU's. CPU "a" will read true on the "if" and CPU "b" will read true on the "else if", thus having their own task.
Now, both CPU's execute separate code blocks simultaneously, performing different tasks simultaneously.
Code executed by CPU "a":
program:
...
do task "A"
...
end program
Code executed by CPU "b":
program:
...
do task "B"
...
end program
This concept can now be generalized to any number of processors.

= Data Parallel Model vs Task Parallel Model =

= Definitions =

= References =
* David E. Culler, Jaswinder Pal Singh, and Anoop Gupta, [http://portal.acm.org/citation.cfm?id=550071 ''Parallel Computer Architecture: A Hardware/Software Approach,''] Morgan-Kauffman, 1999.
* Ian Foster, [http://www.mcs.anl.gov/~itf/dbpp/ ''Designing and Building Parallel Programs,''] Addison-Wesley, 1995.
* Magne Haveraaen, [http://portal.acm.org/citation.cfm?id=1239917 "Machine and collection abstractions for user-implemented data-parallel programming,"] ''Scientific Programming,'' 8(4):231-246, 2000.
* W. Daniel Hillis and Guy L. Steele, Jr., [http://portal.acm.org/citation.cfm?id=7903 "Data parallel algorithms,"] ''Communications of the ACM,'' 29(12):1170-1183, December 1986.
* Alexander C. Klaiber and Henry M. Levy, [http://portal.acm.org/citation.cfm?id=192020 "A comparison of message passing and shared memory architectures for data parallel programs,"] in ''Proceedings of the 21st Annual International Symposium on Computer Architecture,'' April 1994, pp. 94-105.
* Yan Solihin, ''Fundamentals of Parallel Computer Architecture: Multichip and Multicore Systems,'' Solihin Books, 2008.

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:42:30Z

Vrmanda: /* Task Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:35:27Z

Vrmanda: /* Comparison with Message Passing and Shared Memory */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:28:32Z

Vrmanda: /* Data Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:26:44Z

Vrmanda: /* Data Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-30T00:04:23Z

Vrmanda: /* Data Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-29T23:39:31Z

Vrmanda: /* Data Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-29T23:27:44Z

Vrmanda: /* Data Parallel Model */

CSC/ECE 506 Spring 2011/ch2 JR

2011-01-29T23:26:52Z

Vrmanda: /* Data Parallel Model */