Expertiza_Wiki - User contributions [en]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:23:02Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube (3-D Mesh)</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*p2/3 -- there are p2/3 links in each of 3 dimensions.
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Y. Solihin, ''Fundamentals of Parallel Computer Architecture''. Madison: OmniPress, 2009.
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:13:34Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube (3-D Mesh)</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*p2/3 -- there are p2/3 links in each of 3 dimensions.
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:12:45Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*p2/3 -- there are p2/3 links in each of 3 dimensions.
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:10:55Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*p1/3(p1/3-1) -- there are p1/3(p1/3-1) links in each of 3 dimensions.
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:09:28Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 3(p1/3-1) --this is the corner-to-corner distance, analogous to the 2-d mesh formula
* ''Bisection BW:'' p2/3 -- p1/3 rows of p1/3 links must be cut to bisect a cube
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T04:03:29Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2 -- p/2 links run from one N-1 cube to the other.
* ''# Links:'' p/2 * log2(p) -- each node has a degree of log2(p). Multiply by p nodes and divide by 2 nodes per link.
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies. Each node is numbered with a bitstring that is log2(p) bits long. The farthest away node is this bitstring's complement. One bit can be flipped per hop so the diameter is log2(p).

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

ECE506 Main Page

2011-04-26T03:57:28Z

Asbransc:

This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.

=Supplements to Solihin Text=

Post links to the textbook supplements in this section.
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms ]]
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]
*Chapter 4b [[Chapter 4b CSC/ECE 506 Spring 2011 / ch4b]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]
*Chapter 12 (Ready for Final Review) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]
*Chapter 12 [[ CSC/ECE 506 Spring 2011/ch12 | Interconnection Network Topologies]]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:56:56Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Packet Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:56:11Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1) -- there are 2 links for each router and there are p routers if p is a power of 2.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:55:29Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2*p -- there are 2 links for each router and there are p routers.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:54:58Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2*p -- there are 2 links for each router and there are p/2 routers.
* ''Degree:'' 3 -- interior routers have a degree of 3.
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:52:08Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p) -- same as a tree since butterfly has the same depth as a tree (just with p nodes at each level)
* ''Bisection BW:'' p -- p links connect the two halves at the top level
* ''# Links:'' 2p*log2(p) -- there are 2*p links at each level times log2(p) levels.
* ''Degree:'' 4 -- the routers in the middle levels all have 4 links. The leaves and routers at the top level each have 2 links.

 

Butterfly has similar performance to Hypercube. In terms of cost, butterfly has a smaller degree (so cheaper routers can be used) but hypercube has fewer links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:45:59Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p) -- same as a regular tree
* ''Bisection BW:'' p/2 -- all links to (one side of) the root must be cut to bisect the tree
* ''# Links:'' plog2(p) -- there are p links at each of log2(p) levels
* ''Degree:'' p -- the root node has p links through it.
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:43:54Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p) -- the path from a leaf through the root to the farthest leaf on the other side
* ''Bisection BW:'' 1 -- breaking either link to the root bisects the tree
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:41:08Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of two-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' 1
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:40:17Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, add the number of vertical links, sqrt(p)(sqrt(p)-1), to the number of horizontal links, also sqrt(p)(sqrt(p)-1), to get 2sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of four-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' 1
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:34:24Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, first the number of vertical links are sqrt(p)(sqrt(p)-1) and then add with number of horizontal links is also sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes which is the sum of 2 edges of length sqrt(p)-1.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of four-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' 1
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-26T03:31:12Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png|thumbnail|300px|left|]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

 
 
 
 

<h1>Types of Network Topologies </h1>

Several metrics are normally choose to represent the cost and performance for a certain topology. In this section, degree, number of links, diameter and bisection width will be calculated for each topology.

<h2>Linear Array</h2>

[[Image:Top_linear.jpg|thumbnail|frame|right|]]

The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well.
 
* ''Diameter:'' p-1
* ''Bisection BW:'' 1
* ''# Links:'' p-1
* ''Degree:'' 2
 
A linear array is the cheapest way to connect a group of nodes together. The number of links and degree of linear array have the smallest value of any topology. However, the draw back of this topology is also obvious: the two end points suffer the longest distance between each other, which makes the diameter p-1. This topology is also not reliable since the bisection bandwidth is 1.

<h2>Ring</h2>
[[Image:Top_ring.jpg|thumbnail|right|]]
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure.
 
* ''Diameter:'' p/2
* ''Bisection BW:'' 2
* ''# Links:'' p
* ''Degree:'' 2
 
Compared with the cheapest linear array topology, the ring topology uses least effort (only add one link) to get a relatively big improvement. The longest distance between two nodes is cut into half. And the biseciton bandwidth has increased to 2.

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg|thumbnail|right|]]

The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. This topology is very suitable for some of the applications such as the ocean application and matrix calculation.
 
* ''Diameter:'' 2(sqrt(p)-1)
* ''Bisection BW:'' sqrt(p)
* ''# Links:'' 2sqrt(p)(sqrt(p)-1)
* ''Degree:'' 4
 
Nodes that are not on the edge have a '''degree''' of 4. To calculate the number of links, first the number of vertical links are sqrt(p)(sqrt(p)-1) and then add with number of horizontal links is also sqrt(p)(sqrt(p)-1). The diameter is calculated by the distance between two diagonal nodes.
 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg|thumbnail|right|]]

Similarly as the trick we did from linear array to ring topology, the 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges.
 
* Diameter sqrt(p)–1
* Bisection BW 2sqrt(p)
* # Links 2p
* Degree 4
 
With end-around connection, the longest distance has been cut. And the biseciton bandwidth also increased. Of course, the cost from 2-D mesh to 2-D torus almost increased twice.
 

<h2>Cube</h2>
[[Image:Top_cube.jpg|thumbnail|right|]]

If we add two more neighbor to each node, we can get a cube. The cube can be thought of as a three-dimensional mesh.
 
* ''Diameter:'' 2(p1/2)-1
* ''Bisection BW:'' p1/2
* ''# Links:'' 3*22/2
* ''Degree:'' 6 (from the inside nodes)
 
We can also extend some of the metrics to the N-dimensional mesh:
 
* ''Diameter:'' N(p1/N)-1
* ''Bisection BW:'' pN-1/N
* ''# Links:'' N*2N/2
 
The degree is hard to calculate in this case. Interestingly enough we found that the degree of cube is 6 which is larger than the degree of four-dimensional mesh which is 4.

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg|thumbnail|right|Hypercube]]

In the N-dimensional cube, the boundary nodes are normally the one who hurts the performance of entire network. Thus, we can fix it by connecting those broundary nodes together. The hypercube is essentially multiple cubes put together.
 
* ''Diameter:'' log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' p/2 * log2(p)
* ''Degree:'' log2(p)
 
From the metrics we can see, the diameter and bisection bandwidth are significantly improved for the high order topologies.

<h2>Tree</h2>

[[Image:Top_tree.jpg|thumbnail|right|Tree]]

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' 1
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The tree experiences high traffic at the upper levels. Since almost half of the messages need go through the root node, the root of the tree becomes the bottom neck of the tree topology. Also, the other disadvantage of tree topology is that the bisection bandwidth is only 1.

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg|thumbnail|right|Fat Tree]]
In order to improve the performance of the tree topology, the fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.
 
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2(p-1)
* ''Degree:'' 3
 
The fat tree relieved pressure of root node, the biseciton bandwidth has also been increased.

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg|thumbnail|right|Butterfly]]
 

The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.
* ''Diameter:'' 2log2(p)
* ''Bisection BW:'' p/2
* ''# Links:'' 2p*log2(p)
* ''Degree:'' 4

 

Butterfly have the same performance compared with Hypercube. In terms of cost, butterfly have less degree but hypercube have less number of links.

<h1>Real-World Implementation of Network Topologies </h1>

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.

[[Image:Disknet_network.jpg|frame|center|''Basic structure of Hospodor and Miller's experimental network''2]]

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.

[[Image:Disknet_butterfly.jpg|frame|center|''Butterfly structure''2]]

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.

[[Image:Disknet_mesh.jpg|frame|center|''Mesh structure''2]]

[[Image:Disknet_torus.jpg|frame|center|''Torus structure''2]]

Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.

The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.

[[Image:Disknet_hypercube.jpg|frame|center|''Hypercube structure''2]]

It is worth to mention that even though there are many topologies have much better performance than 2-D mesh, the cost of these advanced topologies are also high. Since most of the chips is in 2-D space, it is very expensive to implement high dimensional topology on 2-D chip. For hypercube topology, the increases of number of node will cause higher degree for each node. For the butterfly topology, although the increases of degree is relatively slow but the required number of links and number of switches increases rapidly.

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

ECE506 Main Page

2011-04-18T06:37:00Z

Asbransc:

This page serves as a portal for all wiki material related to CSC506 and ECE506. Link to any new wiki pages from this page, and add links to any current pages.

=Supplements to Solihin Text=

Post links to the textbook supplements in this section.
*Chapter 2 [[CSC/ECE 506 Spring 2011/ch2 dm | CSC/ECE 506 Spring 2011/ch2 dm]]
*Chapter 2 [[Parallel_Programming_Models | Parallel Programming Models]]
*Chapter 2 (Still being revised) [[CSC/ECE 506 Spring 2011/ch2 cl | CSC/ECE 506 Spring 2011/ch2 cl]]
*Chapter 2a [[ CSC/ECE 506 Spring 2011/ch2a mc | Current Data-Parallel Architectures ]]
*Chapter 3 (Final Revision) [[ CSC/ECE 506 Spring 2011/ch3 ab | Parallel Architecture Mechanisms and Programming Models ]]
*Chapter 4a[[ CSC/ECE 506 Spring 2011/ch4a ob | Parallelization of Nelder Mead Algorithm ]]
*Chapter 4a (Under Construction) [[ CSC/ECE_506_Spring_2011/ch4a_bm | Parallelization of Algorithms ]]
*Chapter 4a [[ CSC/ECE 506 Spring 2011/ch4a zz | CSC/ECE 506 Spring 2011/ch4a zz ]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a jp | CSC/ECE 506 Spring 2011/ch6a jp ]]
*Chapter 6a (Under Construction) [[ CSC/ECE 506 Spring 2011/ch6a ep | CSC/ECE 506 Spring 2011/ch6a ep ]]
*Chapter 6b (Ready for First Review) [[CSC/ECE 506 Spring 2011/ch6b ab | CSC/ECE 506 Spring 2011/ch6b ab]]
*Chapter 7 (Under Construction) [[CSC/ECE 506 Spring 2011/ch7 jp | CSC/ECE 506 Spring 2011/ch7 jp]]
*Chapter 8 [[CSC/ECE 506 Spring 2011/ch8 mc | CSC/ECE 506 Spring 2011/ch8 mc]]
*Chapter 10 (Under Construction) [[CSC/ECE 506 Spring 2011/ch10 sb | CSC/ECE 506 Spring 2011/ch10 sb]]
*Chapter 10a [[CSC/ECE_506_Spring_2011/ch10a_dc | CSC/ECE_506_Spring_2011/ch10a_dc]]
*Chapter 11 [[CSC/ECE_506_Spring_2011/ch11_BB_EP | Chapter 11 Supplement]]
*Chapter 11 [[Scalable_Coherent_Interface | SCI (Scalable Coherent Interface) ]]
*Chapter 12 (under construction) [[ CSC/ECE 506 Spring 2011/ch12 ob | Interconnection Network Topologies and Routing Algorithms]]
*Chapter 12 (under construction) [[ CSC/ECE 506 Spring 2011/ch12 aj | Interconnection Network Topologies and Routing Algorithms]]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T06:35:45Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together.

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels.

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

 

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
This link is down at this time. [http://webcache.googleusercontent.com/search?q=cache:2f2KNFWJCsQJ:www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf+history+of+interconnection+topologies&cd=9&hl=en&ct=clnk&gl=us&client=ubuntu&source=www.google.com Google Cache]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T06:32:54Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus8. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology9,10. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes 11,12.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together.

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels.

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

 

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]
 
8 [http://www.redbooks.ibm.com/abstracts/sg245161.html?Open Understanding and Using the SP Switch]
 
9 [http://www.myri.com/myrinet/overview/ Myrinet Overview]
 
10 [http://en.wikipedia.org/wiki/QsNet QsNet (Quadrics' network)]
 
11 [http://www.google.com/research/pubs/pub35155.html Dragonfly Topology]
 
12 [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.573&rep=rep1&type=pdf Flattened Butterfly Topology]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T06:28:25Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together.

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels.

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

 

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

However, when the number of nodes increases, the relative cost of the higher-dimensional topologies increases far faster than their relative performance when compared to a 2-D mesh. This is because the 2-D mesh only uses low-cost, short links. The higher-dimensional structures must be projected onto our 3-dimensional world, and thus require many long, expensive links that wrap around the outside of the system like an impenetrable tangle of jungle vines. Maintaining such a network is also quite slow and tedious.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T06:20:58Z

Asbransc:

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png]]

This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together.

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels.

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

 

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. However, with modern router technology, the number of ports is a less important consideration.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T06:11:05Z

Asbransc: history

<h1>Interconnection Network Architecture </h1>


In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.


 


Typically, in a multiprocessor system, message passed between processors are frequent and short1. Therefore, the interconnection network architecture must handle messages quickly by having '''low latency''', and must handle several messages at a time and have '''high bandwidth'''.


 


In a network, a processor along with its cache and memory is considered a '''node'''. The physical wires that connect between them is called a '''link'''. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network '''topology'''.


<h1>History of Network Topologies</h1>

Hypercube topologies were invented in the 80s and had desirable characteristics when the number of nodes is small (~1000 maximum, often <100) and every processor must stop working to receive and forward the message 4. The low-radix era began in 1985 and was defined by routers with between 4 and 8 ports using toroidal, mesh or fat-tree topologies and wormhole routing. This era lasted about 20 years until it was determined that routers with dozens of ports offered superior performance. Two topologies were developed to take advantage of the newly developed high-radix routers. These are flattened butterfly and dragonfly, which are somewhere between a mesh with each point on the mesh being a router (or virtual router in the case of dragonfly) with dozens or hundreds of nodes attached and a fat tree with sufficiently high arity as to only have two levels.

<h2>Interconnection evolution in the Top500 List</h2>
[[Image:Top500interconnect.png]]
This chart shows the evolution over time of the different interconnect topologies by their dominance in the top500 list of supercomputers 7. As one can see, many technologies came into vogue briefly before losing performance share and disappearing. In the early days of the list, most of the computers list that the interconnect type is not applicable. However, the trailing end of the hypercube phase is clear in burnt orange. The dark blue at the top is "other" and the dark red in the middle is "proprietary", so we can only speculate about what topologies they might employ. The toroidal mesh appears briefly at the start in a cream color, and slightly outlasts the hypercube. The two crossbar technologies (blue and olive) followed the toroidal mesh. The fully-distributed crossbar died out quickly, but the multi-stage crossbar lasted longer but wasn't ever dominant. The 3-D torus (purple) dominates much of the 90s with hypercube topologies (dark pink) enjoying a short comeback in the later part of the decade. SP Switch (light olive), an IBM interconnect technology which uses a multi-stage crossbar switch replaced the 3-D torus. Myrinet, Quadrics, and Federation all shared the spotlight in the mid 00s each used a similar fat-tree topology. The current class of supercomputers is dominated by nodes connected with either Infiniband or gigabit ethernet. Both can be connected in either a fat-tree or 2-D mesh topology. The primary difference between them is speed. Infiniband is considerably faster per link and allows links to be ganged into groups of 4 or 12. Gigabit ethernet is vastly less expensive, however, and some supercomputer designers have apparently chosen to save money on the interconnect technology in order to allow the use of faster nodes.

<h1>Types of Network Topologies </h1>

<h2>Linear Array</h2>

[[Image:Top_linear.jpg]]
 
 
The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the '''diameter''', is equivalent to the number of nodes.

 

<h2>Ring</h2>

[[Image:Top_ring.jpg]]
 
 
Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half.

 

<h2>2-D Mesh</h2>

[[Image:Top_2Dmesh.jpg]]
 
 
The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a '''degree''' of 4.

 

<h2>2-D Torus</h2>

[[Image:Top_2Dtorus.jpg]]
 
 
The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher.

 

<h2>Cube</h2>

[[Image:Top_cube.jpg]]
 
 
The cube can be thought of as a three-dimensional mesh.

 

<h2>Hypercube</h2>

[[Image:Top_hypercube.jpg]]
 
 
The hypercube is essentially multiple cubes put together.

 

<h2>Tree</h2>

[[Image:Top_tree.jpg]]
 
 
The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels.

 

<h2>Fat Tree</h2>

[[Image:Top_fat_tree.jpg]]
 
 
The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels.

 

<h2>Butterfly</h2>

[[Image:Top_butterfly.jpg]]
 
 
The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

 

<h1>Real-World Implementation of Network Topologies </h1>


In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network2. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.


 


In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between2.


 


The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion.
 
[[Image:Disknet_network.jpg]]
 
''Basic structure of Hospodor and Miller's experimental network''2


 

<h2>Fat Tree</h2>

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them2.
 
 
The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors7.

 

<h2>Butterfly</h2>

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken2.
 
[[Image:Disknet_butterfly.jpg]]
 
''Butterfly structure''2

 

<h2>Meshes and Tori</h2>

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures2.
 
 
Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors7.
 
 
[[Image:Disknet_mesh.jpg]]
 
''Mesh structure''2
 
 
[[Image:Disknet_torus.jpg]]
 
''Torus structure''2

 

<h2>Hypercube</h2>

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures.
 
 
The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies.
 
 
[[Image:Disknet_hypercube.jpg]]
 
''Hypercube structure''2

 

<h1>Comparison of Network Topologies </h1>


The following table shows the total number of ports required for each network topology.
 
 
[[Image:Disknet_ports.jpg]]
 
''Number of ports for each topology''2

 

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases.

 
 
Below the average path length, or average number of hops, and the average link load (GB/s) is shown.
 
 
[[Image:Disknet_load.jpg]]
 
''Average path length and link load for each topology''2

 

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load.

 
 
The figure below shows the cost of the network topologies.
 
 
[[Image:Disknet_cost.jpg]]
 
''Cost of each topology''2

 

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube.

 
 
When the cost and average link load is factored the following graph is produced.
 
 
[[Image:Disknet_overall.jpg]]
 
''Overall cost of each topology''2

 

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.


<h1>Routing</h1>


The '''routing''' algorithm determines what path a packet of data will take from source to destination. Routing can be '''deterministic''', where the path is the same given a source and destination, or '''adaptive''', where the path can change. The routing algorithm can also be '''partially adaptive''' where packets have multiple choices, but does not allow all packets to use the shortest path3.


<h2>Deadlock</h2>

When packets are in '''deadlock''' when they cannot continue to move through the nodes. The illustration below demonstrates this event.
 
 
[[Image:Routing_deadlock.jpg]]
 
''Example of deadlock''
 
 

Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked.

 
 

The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern.

 
 

To avoid circular patterns of routing, some routing patterns are disallowed. These are called '''turn restrictions''', where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.


 
<h2>Dimensional ordered (X-Y) routing</h2>

Turns from the y-dimension to the x-dimension are not allowed.


<h2>West First</h2>

Turns to the west are not allowed.


<h2>North Last</h2>

Turns after a north direction are not allowed.


<h2>Negative First</h2>

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.


<h2>Odd-Even Turn Model</h2>


Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer3.


 


Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models3. The model is designed primarily for 2-D meshes.

 

''Turns from the east to north direction from any node on an even column are not allowed.''
 
''Turns from the north to west direction from any node on an odd column are not allowed.''
 
 
''Turns from the east to south direction from any node on an even column are not allowed.''
 
''Turns from the south to west direction from any node on an odd column are not allowed.''
 
 
 
The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed.
 
 
[[Image:Routing_odd_even.jpg]]
 
''Odd-Even turn restriction model proposed by Ge-Ming Chiu''3
 

 

<h1>Comparison of Turn Restriction Models</h1>

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and has a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

 

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.



 
 
[[Image:Routing_uniform.jpg]]
 
''Uniform traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency.



 
 
[[Image:Routing_transpose.jpg]]
 
''First transpose traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models.



 
 
[[Image:Routing_transpose2.jpg]]
 
''Second transpose traffic simulation of various turn restriction models''3
 
 
With the second transpose simulation, the odd-even model outperforms the rest.



 
 
[[Image:Routing_hotspot.jpg]]
 
''Hotspot traffic simulation of various turn restriction models''3
 
 
The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%.



 
 
[[Image:Routing_hotspot2.jpg]]
 
''Second hotspot traffic simulation of various turn restriction models''3
 
 
When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.


 
 

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.


 

<h1>Router Architecture</h1>

The '''router''' is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms.

 
 

The internal architecture of a router consists of input and output ports and a '''crossbar switch'''. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer.

 
 

Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years4.


 
[[Image:Router_bandwidth.jpg]]
 
''Bandwidth of various routers over 10 year period''4

 

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.


 
[[Image:Router_physical.jpg]]
 
''Router hardware over period of time''4
 

 
[[Image:Router_radix.jpg]]
 
''Radix and latency of routers over 10 year period''4
 
 

The '''radix''', or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.


 


With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.


 

<h1>Fault Tolerant Routing</h1>

Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components6. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.


<h2>Fault Models</h2>

Faults in a network can be categorized in two types:
 
 
1.'''Transient Faults'''5 : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate.
 
 
2.'''Permanent Faults'''5: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.


<h2>Fault Tolerance Mechanisms (for permanent faults)</h2>

The permanent faults can be handled using one of the two mechanisms:
 
 
1.'''Static Mechanism''': In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path.
 
 
2.'''Dynamic Mechanisms''': In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are:
 
 
a.'''Block Faults''': In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG).
 
 
DISADVANTAGE: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity.
 
 
[[Image:Fault_pic1.jpg]]
 
 
b.'''Fault Rings''': This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them.
 
 
[[Image:Fault_pic2.jpg]]
 


 

<h1>References</h1>


1 Solihin text
 
2 [http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems]
 
3 [http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf The Odd-Even Turn Model for Adaptive Routing]
 
4 [http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf Interconnection Topologies:(Historical Trends and Comparisons)]
 
5 [http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1 Efficient mechanisms to provide fault tolerance in interconnection networks for PC clusters, José Miguel Montañana Aliaga.]
 
6 [http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4 Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad]
 
7 [http://www.top500.org TOP500 Supercomputing Sites]

File:Top500interconnect.png

2011-04-18T04:51:13Z

Asbransc: The evolution of interconnect families over time.

The evolution of interconnect families over time.

CSC/ECE 506 Spring 2011/ch12 aj

2011-04-18T04:40:17Z

Asbransc:

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T04:37:17Z

Asbransc: fixed the format of the references

==Overview==
Cache addressing has a significant impact on the performance of the cache, determining cache latency and when the cache must be flushed. Since a cache is designed purely to improve performance, the addressing scheme must be a prime consideration.

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location. [[#References|[1]]]

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup. One small downside is that the tags must be longer because they must contain the entire physical address rather than the part of the address not used for indexing as in the previous examples. [[#References|[1]]]

===Virtually Indexed, Physically Tagged===
This cache type allows the virtual address to be used right away to begin the lookup of the cache line. While this is going on, the TLB look up for the physical address can occur in parallel. When both lookups are complete, the physical address returned from the TLB is compared with the tag on the cache blocks to determine if the requested data is on this line. This hides the latency from address translation (assuming it takes approximately as long as retrieving the cache line) and obviates the need to flush the cache on a context switch because the physical address is used for the final check of the tag. [[#References|[1]]]

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging. [[#References|[1]]]

==TLB Coherence==
There is the potential for the information in one processor's TLB to be made stale by another processor if the other processor changes the permissions on a page or handles the swapping of a page out to disk. Thus there must be some method of updating the TLB with fresher information when this (somewhat rare) scenario occurs. [[#References|[2]]]

===Virtually Addressed Caches===
If the cache is virtually addressed and the miss rate is sufficiently low, the TLB can be eschewed entirely without impacting performance too much because it would only be used when a memory access is already required. The TLB need not be kept coherent if it doesn't exist. [[#References|[2]]]

===TLB Shootdown===
The processor making changes to the page table sends an interrupt to other processors alerting them that there has been a change made. The other processors look at a shared memory to determine which page table entries have changed and either invalidate or update their TLBs accordingly.[[#References|[2]]]

===Address Space Identifiers===
This is a concept similar to process tagging in a virtually addressed cache. The software maintains control of the TLB and marks each of the entires with an address space identifier denoting which process the buffered translation belongs to. These identifiers can be used by the OS to manage TLB coherence by updating or invalidating other processors' TLBs or flushing only the entries corresponding to the process whose page table is changing. The MIPS architecture uses this strategy.[[#References|[2]]]

===Write Invalidate===
This protocol uses the fact that other processors are already implementing a cache coherence protocol by snooping the bus and responding to the instructions and data that go across it. When a processor changes a page table entry it issues a command on the bus similar to a BusUpgr as used in cache coherence that tells the snooping processors to invalidate that entry in their TLBs. The PowerPC architecture uses this to maintain its TLB coherence.[[#References|[2]]]

==Other Contemporary Issues==
The increase in prevalence of virtualization has caused many architectural changes to newer x86 processors. The TLBs were formerly managed fully by the hardware, but in order to better cope with virtual machines, both Intel and AMD have added address space identifiers to the TLB so that the entire thing isn't flushed every context switch. [[#References|[3]]]

==References==
[http://www.linuxjournal.com/article/7105?page=0,1 1: Linux Journal Article on Caching]

[http://books.google.com/books?id=g82fofiqa5IC&pg=PA440&lpg=PA440&dq=tlb+coherence&source=bl&ots=COtleqdaUp&sig=fCU_8vD9_PhadrY62lneWUMG57g&hl=en&ei=VWhsTfatNMH7lwek9e3-BA&sa=X&oi=book_result&ct=result&resnum=6&ved=0CDYQ6AEwBQ#v=onepage&q=tlb%20coherence&f=false 2: Parallel computer architecture: a hardware/software approach by David E. Culler]

[http://en.wikipedia.org/wiki/Translation_lookaside_buffer#Virtualization_and_x86_TLB 3: Wikipedia on the TLB]

ECE506 Main Page

2011-03-01T04:35:54Z

Asbransc: Added my chapter

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T04:34:30Z

Asbransc: Done

==Overview==
Cache addressing has a significant impact on the performance of the cache, determining cache latency and when the cache must be flushed. Since a cache is designed purely to improve performance, the addressing scheme must be a prime consideration.

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location. [[#References|[1]]]

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup. One small downside is that the tags must be longer because they must contain the entire physical address rather than the part of the address not used for indexing as in the previous examples. [[#References|[1]]]

===Virtually Indexed, Physically Tagged===
This cache type allows the virtual address to be used right away to begin the lookup of the cache line. While this is going on, the TLB look up for the physical address can occur in parallel. When both lookups are complete, the physical address returned from the TLB is compared with the tag on the cache blocks to determine if the requested data is on this line. This hides the latency from address translation (assuming it takes approximately as long as retrieving the cache line) and obviates the need to flush the cache on a context switch because the physical address is used for the final check of the tag. [[#References|[1]]]

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging. [[#References|[1]]]

==TLB Coherence==
There is the potential for the information in one processor's TLB to be made stale by another processor if the other processor changes the permissions on a page or handles the swapping of a page out to disk. Thus there must be some method of updating the TLB with fresher information when this (somewhat rare) scenario occurs. [[#References|[2]]]

===Virtually Addressed Caches===
If the cache is virtually addressed and the miss rate is sufficiently low, the TLB can be eschewed entirely without impacting performance too much because it would only be used when a memory access is already required. The TLB need not be kept coherent if it doesn't exist. [[#References|[2]]]

===TLB Shootdown===
The processor making changes to the page table sends an interrupt to other processors alerting them that there has been a change made. The other processors look at a shared memory to determine which page table entries have changed and either invalidate or update their TLBs accordingly.[[#References|[2]]]

===Address Space Identifiers===
This is a concept similar to process tagging in a virtually addressed cache. The software maintains control of the TLB and marks each of the entires with an address space identifier denoting which process the buffered translation belongs to. These identifiers can be used by the OS to manage TLB coherence by updating or invalidating other processors' TLBs or flushing only the entries corresponding to the process whose page table is changing. The MIPS architecture uses this strategy.[[#References|[2]]]

===Write Invalidate===
This protocol uses the fact that other processors are already implementing a cache coherence protocol by snooping the bus and responding to the instructions and data that go across it. When a processor changes a page table entry it issues a command on the bus similar to a BusUpgr as used in cache coherence that tells the snooping processors to invalidate that entry in their TLBs. The PowerPC architecture uses this to maintain its TLB coherence.[[#References|[2]]]

==Other Contemporary Issues==
The increase in prevalence of virtualization has caused many architectural changes to newer x86 processors. The TLBs were formerly managed fully by the hardware, but in order to better cope with virtual machines, both Intel and AMD have added address space identifiers to the TLB so that the entire thing isn't flushed every context switch. [[#References|[3]]]

==References==
[http://www.linuxjournal.com/article/7105?page=0,1 1: Linux Journal Article on Caching]
[http://books.google.com/books?id=g82fofiqa5IC&pg=PA440&lpg=PA440&dq=tlb+coherence&source=bl&ots=COtleqdaUp&sig=fCU_8vD9_PhadrY62lneWUMG57g&hl=en&ei=VWhsTfatNMH7lwek9e3-BA&sa=X&oi=book_result&ct=result&resnum=6&ved=0CDYQ6AEwBQ#v=onepage&q=tlb%20coherence&f=false 2: Parallel computer architecture: a hardware/software approach
By David E. Culler]
[http://en.wikipedia.org/wiki/Translation_lookaside_buffer#Virtualization_and_x86_TLB 3:Wikipedia on the TLB]

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T03:29:19Z

Asbransc:

==Overview==

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location. [[#References|[1]]]

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup. One small downside is that the tags must be longer because they must contain the entire physical address rather than the part of the address not used for indexing as in the previous examples. [[#References|[1]]]

===Virtually Indexed, Physically Tagged===
This cache type allows the virtual address to be used right away to begin the lookup of the cache line. While this is going on, the TLB look up for the physical address can occur in parallel. When both lookups are complete, the physical address returned from the TLB is compared with the tag on the cache blocks to determine if the requested data is on this line. This hides the latency from address translation (assuming it takes approximately as long as retrieving the cache line) and obviates the need to flush the cache on a context switch because the physical address is used for the final check of the tag. [[#References|[1]]]

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging. [[#References|[1]]]

==TLB Coherence==

===<Recent Processor>'s approach===

==Other Contemporary Issues==

==References==
<references>
[http://www.linuxjournal.com/article/7105?page=0,1 Linux Journal Article on Caching]
</references>

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T03:26:46Z

Asbransc:

==Overview==

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location. <ref name="lj"/>

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup. One small downside is that the tags must be longer because they must contain the entire physical address rather than the part of the address not used for indexing as in the previous examples. <ref name="lj"/>

===Virtually Indexed, Physically Tagged===
This cache type allows the virtual address to be used right away to begin the lookup of the cache line. While this is going on, the TLB look up for the physical address can occur in parallel. When both lookups are complete, the physical address returned from the TLB is compared with the tag on the cache blocks to determine if the requested data is on this line. This hides the latency from address translation (assuming it takes approximately as long as retrieving the cache line) and obviates the need to flush the cache on a context switch because the physical address is used for the final check of the tag. <ref name="lj"/>

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging.

==TLB Coherence==

===<Recent Processor>'s approach===

==Other Contemporary Issues==

==References==
<references>
<ref name="lj">http://www.linuxjournal.com/article/7105?page=0,1</ref>
</references>

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T02:45:52Z

Asbransc: finished section on the 4 addressing schemes

==Overview==

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location.

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup.

===Virtually Indexed, Physically Tagged===
This cache type allows the virtual address to be used right away to begin the lookup of the cache line. While this is going on, the TLB look up for the physical address can occur in parallel. When both lookups are complete, the physical address returned from the TLB is compared with the tag on the cache blocks to determine if the requested data is on this line. This hides the latency from address translation (assuming it takes approximately as long as retrieving the cache line) and obviates the need to flush the cache on a context switch because the physical address is used for the final check of the tag.

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging.

==TLB Coherence==

===<Recent Processor>'s approach===

==Other Contemporary Issues==

CSC/ECE 506 Spring 2011/ch6b ab

2011-03-01T02:35:55Z

Asbransc: About 1/3 done

==Overview==

==Cache Addressing==
The data in a CPU cache is addressed using an index and a tag. The index is used to find the cache line where the block containing the data being sought might be stored and the tag is used to determine if the data contained in any of the blocks at that line is indeed the data being sought. Each of these two lookup operations can proceed using either the physical or the virtual address. This leads to four possible schemes for cache addressing.

===Virtually Indexed, Virtually Tagged===
In a cache that uses the virtual address for both the index and the tag, no address translation is required on a cache hit. Thus the TLB and page table are only used on a cache miss. This allows for expedient retrieval of the requested data from the cache since no lookup occurs and the operand of the load or store instruction can be used as-is. However, after context switch, the same virtual addresses can now refer to completely different data so the cache must recognize this and flush on a context switch or at the very least flush the lines that conflict. Another issue with a VIVT cache is the same data may have different virtual addresses if it is shared among different threads/processes. This data would be stored at multiple places in the cache even though it originates from a single memory location.

===Physically Indexed, Physically Tagged===
A lookup in this type of cache requires an address translation be the first step of any memory access. Thus the TLB must be large enough to contain references for the data in the cache otherwise the address translation would require a main memory access even on a cache hit, defeating the purpose of caching the value in the first place. The time to translate the address through the TLB is still non-negligible and is added on the front of the latency incurred by the cache lookup itself. After the address translation is complete, the cache uses the resultant physical address to find the line and check the tag. No flushing is necessary on a context switch because there is only one line on which any cache-block-sized piece of memory can reside, and the physical address is compared with the tag to determine if the data associated with the requested memory location is indeed present on that cache line. If multiple virtual addresses correspond to a single physical address, they will all seek out the same cache block when they do a cache lookup.

===Virtually Indexed, Physically Tagged===

===Physically Indexed, Virtually Tagged===
This is basically a "worst of both worlds" approach. The address translation must still be performed in order to find the index, which increases latency, and the cache must still be flushed on a context switch since there is the potential for a tag conflict using virtual tagging.

==TLB Coherence==

===<Recent Processor>'s approach===

==Other Contemporary Issues==

CSC/ECE 506 Spring 2011/ch6b ab

2011-02-28T23:29:23Z

Asbransc: Put in the Section Headings

==Overview==

==Cache Addressing==

===Virtually Indexed, Virtually Tagged===

===Physically Index, Physically Tagged===

===Virtually Indexed, Physically Tagged===

==TLB Coherence==

===<Recent Processor>'s approach===

==Other Contemporary Issues==