Talk:ECE506 Main Page

From Expertiza_Wiki
Jump to navigation Jump to search

Chapter 4b CSC/ECE 506 Spring 2011 / ch4b

\\

\\

\\

\\

\\

\\

Chapter 4b CSC/ECE 506 Spring 2011 / ch4b

\\

\\

\\

\\

Name : Deepika Bhargava

ID : 000171225

Paper 1

\\

As a book on computer architecture, the text does not go into a great deal of detail on parallel programming techniques. Many of the topics that are mentioned in Chapter 4 can be explored in greater depth. The granularity of parallel threads depends quite closely on the architecture. The granularity of synchronization is a topic that has been studied for many years. There are times when, due to the overhead of locking, it does not make sense to invoke a lock, but rather, just to wait in a loop until a specific predicate is satisfied. This can be considered in more detail, but be sure not to overlap the discussion of lock implementations in Chapter 9 of Solihin.

The textbook considers scheduling and load balancing in somewhat greater detail, but much more can be said. For example, one might consider when [gang scheduling|http://en.wikipedia.org/wiki/Gang_scheduling] is useful; this depends on the overhead of communication and task management. The same is true of load balancing; discuss the tradeoff between static and dynamic scheduling, and different approaches to dynamic scheduling.

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

\\

  • Index *

\\

  1. Introduction
  2. Parallel Thread Granularity
    1. Instruction Level Parallelism:
    2. Loop Level Parallelism
    3. Procedure Level Parallelism:
    4. Subprogram Level Parallelism
    5. Job level parallelism
  3. Scheduling
  4. Load Balancing
  5. Gang Scheduling

\\

\\

  • Introduction*

\\

The granularity of parallel threads depends quite closely on the architecture. For attaining good parallel performance it is important to use the right granularity. Granularity can be defined as the amount of work done by a task, between communications with other tasks. Granularity of parallel threads is achieved at the expense of communication overhead, load balance and extra work to be done by threads. If granularity is too fine, then performance can suffer from communication overhead. If granularity is too coarse, then performance can suffer from load imbalance. If granularity is too fine, the overhead of creating and communicating with threads is too great. There is a middle ground where communication overhead is reduced. The goal of design should to determine the right granularity for parallel tasks, while avoiding load imbalance and communication overhead to achieve the best performance.

Granularity is calculated in terms of computation time and communication time, or in terms of coarse, medium or fine granularity.

\\

Granularity = Computation Time / Communication Time

\\

This ratio is a measure of how much overhead is produced per unit of computation. A large granularity implies that the communication overhead is insignificant compared to computation time while a low ratio implies that the communication overhead dominates computation time, that which results in poor performance. Message-passing multiprocessors usually employ medium/coarse granularity. Fine granularity is used in dataflow systems. With coarse granularity, each process contains a large number of sequential instructions and takes a substantial time to execute. In fine granularity, a process might consist of a few instructions or just one instruction. As the granularity is reduced, the process communication overhead generally increases. It is particularly desirable to reduce the communication overhead because of the significant time for communication. The tendency to obtain maximum performance is to resort to finest possible granularity, which leads to highest degree of parallelism. But maximum parallelism should not lead to increasing overhead.

\\

In message passing systems overhead can be reduced by mapping several processes onto one processor and switching from one process to another when a processes held up by message passing. The disadvantage of static assignment is that load sharing will be unclear before program is actually executed. Code and data have to be physically transferred to the local memory of each node prior to execution, and this transfer can cause a significant amount of overhead. Similarly results need to be transferred from the nodes to the host system. Clearly, the computation to be performed needs to be reasonably long to lessen the loading overhead.

\\

  • Parallel Thread Granularity*

\\

Parallel thread granularity is a important factor affecting scalability of the parallel program. It affects the relative cost of thread management, which involves creation, task assignment, and synchronization. Larger the task granularity, the smaller thread management overheads are relative to the execution time of the tasks. One way to increase parallel task granularity is through choosing to parallelize the outer loop and not parallelize the inner loop. When there is poor load balancing we can consider parallelizing an inner loop rather than the outer loop. We can improve load balancing with the use of dynamic task to thread mapping scheme. Instead of assigning threads statically we can have dynamic assignment in which central task queue is up to keep tasks in the form of groups of loop iterations. During runtime, whenever a thread becomes idle it grabs a task. (Sollihin,2009:90)

\\

Parallelism can be exploited at various processing levels. Five processing levels are as follows:

  • Instruction Level
  • Loop Level
  • Procedure Level
  • Subprogram Level
  • Job Level

\\

The execution of a program may involve a combination of these levels. Number of levels of parallelism dictates the granularity. The execution of a program may involve combination of these levels. The actual depends on application, formulation, algorithm, language, program, compilation support, and hardware limitations.

\\

  • Instruction Level Parallelism:*

At instruction level parallelism a grain contains less than 20 instructions, called fine grain. Depending on individual programs, fine grain parallelism at this level may range from two to thousands. The advantage of fine grained computation lies in the abundance of parallelism. The exploitation of fine grained parallelism can be assisted by an optimizing compiler which should be able to automatically detect parallelism and translate the source code to a parallel form which can be recognized by the run-time system.( Patterson, 1995). Instruction-level parallelism is rather tedious for an ordinary programmer to detect in a source code. Synchronization interval should be less than 20. Synchronization at the instruction level by means of conditional wait statements that automatically resume when their condition becomes true.

\\

  • Loop Level Parallelism:*

\\

Loop level parallelism corresponds to the iterative loop operations. A typical loop contains less than 500 instructions. Some loop operations are independent in successive iterations, can be vectorized for a pipelined execution or for lock-step execution on SIMD machines. Some loop operations can be self-scheduled for parallel execution on MIMD machines. Loop level parallelism is most optimized program construct to execute on a parallel or vector computer. However, recursive loops are rather difficult to parallelize. Loop level is still considered a fine grain of computation. Synchronization interval is 20 to 200 (Patterson, 1995).

\\

  • Procedure Level Parallelism: *

\\

This level corresponds to medium grain size at the task, procedural, subroutine and co-routine levels. A typical grain contains less than 2000 instructions. Detection of parallelism at this level is much more difficult than at the finer-grain levels. The communication requirement is often less compared with that required in MIMD execution mode. SPMD execution mode is a special case at this level. Multitasking also belong in this category. Significant efforts by programmers may be needed to restructure a program at this level, and some compiler assistance is also needed. Synchronization interval is 200 to 2000 (Patterson, 1995).

\\

  • Subprogram Level Parallelism: *

\\

This corresponds to the level of job steps and related subprograms. The grain size may typically contain thousands of instructions. Job steps can overlap across different jobs. Subprograms can be scheduled for different processors in SPMD or MPMD mode, often in message passing multicomputer.

\\

Multiprogramming on uniprocessors or on a multiprocessor is conducted at this level. In this past, parallelism at this level has been exploited by algorithm designers or programmers, rather than by compilers. We do not have good compilers for exploiting medium or coarse grain parallelism at present. Synchronization interval is 2000 to 1M (Patterson, 1995).

\\

  • Job level parallelism: *

\\

This corresponds to the parallel execution of essentially independent jobs on a parallel computer. The grain sizes can be as high as tens of thousands of instructions in a single program. For supercomputers with a small number of very powerful processors, such as coarse grain parallelism is practical. Job-level parallelism handled by the program loader and by the operating system in general. Time sharing and space sharing multiprocessors explore this level of parallelism. In fact, both time and space sharing are extensions of multiprogramming.

\\

To summarize, fine grain parallelism is often exploited at instruction or loop levels, preferably assisted by a parallelizing or vectorizing compiler. Medium-grain parallelism at the task or job step demands significant roles for the programmer as well as compilers. Coarse-grain parallelism at the program level relies heavily on an effective OS and on the efficiency of the algorithm used. Shared variable communications is often used to support fine-grain and medium grain computations. Message passing multi-computers have been used for medium and coarse grain computations. In general, finer the grain size, the higher the potential for parallelism and the higher the communication and schedule overhead. Fine grain provides a higher degree if parallelism, but heavier communication overhead, as compared with coarse-grain computations. Massive parallelism is often explored at the fine-grained level, such as data parallelism in SIMD and MIMD computers. Synchronization interval is 2000 to 1M (Patterson, 1995).

\\

To schedule parallel tasks in multiprocessors we need to use synchronization of processes. The use of multiprogramming on individual processors depends on the synchronization granularity. A access control mechanism is needed to maintain orderly access. Sequence control mechanism is needed to ensure the ordering of operations.

\\

Processors also compete with each other to gain access to shared data items. Various synchronization techniques locks, test and set, barriers, point-to-point, critical section, semaphores and precedence are some methods used to achieve synchronization. Point-to-point synchronization is cheaper than locks and barrier.

\\

Amount of time spent in locks and critical section increases with processors and this can increase overhead. Similarly simple spin-lock scheme will not scale well with large machines with all processors contending for the same lock. The bus that act as a point of serialization for all the processors will lead to lot of contention as well as traffic. The fairness property of bus makes things worse, since it delays the processor that claims lock from releasing it. The root cause of the problem is the contention and the fact that the lock access is serialized. The key advantage of spin and lock is low overhead when it is reused by same processor.

Parallel tasks can be synchronized through the use of synchronization points called barrier. No task may proceed beyond a barrier until all participating tasks have reached that barrier. Members of a group wait at a barrier until a specified number of group members check at the barrier. To implement barrier we use spin lock, which will suffer with the problems specified above. Frequently barrier is used within a loop, so that processes released from the barrier would do some work and then reach the barrier again. If one of the processes never leave barrier, this could happen if OS schedule another process. Now it is possible that one process races ahead of and gets to barrier again before the last process has left. All the processes are trapped now. Solution is of course to count number of processes entered and left. But this will add time and in turn latency of barrier and increase contention. So performance can be poor.

\\

Parallel program is a collection of tasks. Associated with each program is a flow of control among the sub program units or tasks. Certain tasks have to be completed before others can be initiated. These tasks may run serially or in parallel. A optimal schedule determines the allocation of tasks to processors of multiprocessor system and the execution order of the tasks so as to achieve the shortest execution time.

\\

  • Scheduling:*

\\

Scheduling on a multiprocessor involves three interrelated issues. Firstly, the assignment of processes to processors, use of multiprogramming on individual processors and the actual dispatching of a process. Scheduling depends on degree of granularity and number of processors available. Scheduling techniques can be classified into two groups:

\\

  • Dynamic Scheduling
  • Static Scheduling

\\

In static scheduling , each task is allocated to a particular processor based in the analysis of the precedence constraints imposed by the tasks at hand. Each time the task is executed, it is allocated to that predetermined processor. Each task is allocated before hand so less overhead. Obviously, this method does not take into consideration the nondeterministic nature of tasks brought about by conditional branches and loops in the program. The target of the conditional branch and the upper bounds of the loops in the program. The target of the conditional branch and the upper branch of the loops are not known until the program execution begins. Thus, static scheduling is not always optimal. Since assignment is predetermined, static techniques do not incur much task management overhead at run time. To achieve good load balancing, they require that the relative amounts of work in different tasks be adequately predictable or that enough tasks exit to ensure a balanced distribution.

\\

In dynamic scheduling, tasks are allocated to the processor based in the execution characteristics. Usually some load balancing is employed in determining optimal allocation. Because the scheduler has only the knowledge of local information about the program at any time, finding the global optimum is difficult. Another disadvantage is the increased overhead because the schedule has to be determined while the tasks are running. Another disadvantage is inefficient use of cache memory, and it is more difficult for the processors to communicate with dynamic scheduling.

\\

Dynamic scheduling have lot of advantages too. It enables handling some cases when dependences are unknown at compile time. And it simplifies the compiler. It also allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. Although these advantages are gained at a cost of a significant increase in hardware complexity. In dynamic scheduling techniques usually provide good load balancing despite of unpredictability or environmental conditions, they make the management of parallelism more expensive.

\\

Dynamic scheduled processor cannot remove true data dependences; it tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling, like that we have already seen, tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. Static scheduling can also be used on code destined to run on a processor with a dynamically scheduled pipeline.

\\

There are two basic scheduling models used by operating systems in multiprocessor systems. These two systems differ essentially in the granularity of the units of work schedule as one entity. Two types of scheduling are:

\\

  • Process Scheduling
  • Thread Scheduling

\\

Process model scheduling is performed on a per process basis, smallest unit of work to be scheduled is a process. Process scheduling involves the declaration of distinct process states which are connected with scheduling, the specifications of the state transition diagram and the statement of a scheduling policy.

\\

Thread scheduling is a fine grained scheduling model where a smaller unit of work is threads. Thread like process is a sequence of instructions. Threads are created within and belong to processes. All the threads created within one process share the resources of the process, in particular the address space. Thread scheduling is more affordable and advantageous than process model. With finer-grained entities more parallelism can be exposed than in the case of processes. In addition, the creation of threads and the necessary communication, synchronization and switching are far less expensive operations than these for processes, since all threads belonging to the same process share the same resources.

\\

General approaches to thread scheduling are:

\\

  • Load Balancing
  • Gang Scheduling
  • Dedicated Processor Assignment
  • Dynamic Scheduling

\\

  • Load Balancing:*

\\

The key objective for load balancing is to minimize idle time on threads. Sharing the workload equally across all threads with minimal work sharing overheads results in fewer cycles wasted with idle threads not advancing the computation, and thereby leads to improved performance. However, achieving perfect load balance is non-trivial, and it depends on the parallelism within the application, workload, the number of threads, load balancing policy, and the threading implementation. In load balancing processes are not assigned to a particular processor. A global queue of ready threads is maintained, and each processor, when idle, selects a thread from the queue. Load sharing is one of the most common schemes used in multiprocessors.

\\

Load balancing is achieved by distributing load evenly across the processors, assuring that no processor is idle while work is available to do. There is no requirement of centralized scheduler. When a processor is available, the scheduling routine of the operating system is run on that processor to select the next thread. A global queue of ready threads is maintained for load balancing. It can be organized and accessed using any priority-based schemes and schemes that consider execution history or anticipated processing demands. Dynamic task mapping allows to achieve load balancing easily as compared to static load balancing.

\\

Load Sharing uses a global queue, which is accessed by each processor. Therefore load sharing may need mutual exclusion so global queue is not accessed by more than one processor at a time. This may be a bottleneck when more than one processor looks for work at the same time especially in large machines. Preempted threads are unlikely to resume execution on the same processor. Also if all threads are in the global queue, all threads of a program will not gain access to the processors at the same time.

\\

  • Gang Scheduling:*

\\

Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughput. In gang scheduling, a set of related threads is scheduled to run on a set of processors at the same time, on a one-to-one basis. Gang scheduling is useful for applications where performance severely degrades when any part of the application is not running, since threads may often need to synchronize with each other. It uses giant locks to separate the functions of subsystems, which can be continually expanded until subsystem is fully multithreaded to handle many lightweight processes simultaneously. In gang scheduling, independent schedulable threads are formed from an application supported by a user library or language directives. These threads are generally known as virtual processors. The OS kernel multiplexes the thread onto physical processors. Under gang scheduling, jobs with the same resource requirements are collected in groups. Each group is assigned to a partition of the system for a specific amount of time. When the time allocated to a group is up, a context switch occurs, resources are reallocated, and the jobs in new groups are scheduled to execute.

\\

The overhead involved in the gang scheduling can be significant so processors are dedicated for periods longer than the typical time-sharing quantum to reduce this overhead. Processor allocation server is a facility for implementing an allocation policy. The server has direct control over the processors required by the applications. The available processors set is assigned to multiple threads dynamically. The application creates the processor set and a program will execute only threads belonging to the same processor set. Gang scheduling reduces synchronization blocking and require less time for resource allocation. In particular, gang scheduling promotes efficient fine grain interactions among threads in the same gang and provides interactive response time for short jobs.

\\

\\

  • References*

\\

  1. Culler David E., Singh Jaswinder pal, Gupta Anoop 2000. Parallel Computer Architecture a Hardware / Software approach. Morgan Kaufmann Publishers, CA, USA.

\\

  1. Solhini Yan 2009. Fundamentals of Parallel Computer Architecture Multichip and Multicore systems. Student edition, Solihin Publishing and Consulting LLC, USA.

\\

  1. Hwang Kai 1993. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, Inc. USA.

\\

  1. Flynn Micheal J. 1995. Computer Architecture : Pipelined and Parallel Processor Design. Jones and Barlette Publishers, Inc. London, UK.

\\

  1. Patterson David A., Hannessy John L. 1995. Computer Architecture A Quantitative Approach. 2^nd^ Ed., Morgan Kaufmann Publishers, CA, USA.

\\

  1. Sima Deszo, Fountain Terence, Kacsuk Peter, 1997. Advanced Computer Architecture A Design Space Approach. Addison Wesley Longman, Essex, England.

\\

  1. Stone Harold S., 1993. High-Performance Computer Architecture. 3^rd^ Ed., Addison-Wesley Publishing Company Inc., MA, USA.

\\

  1. Abd-El-Barr Mostafa, El-Rewini Hesham, 2005. Advanced Computer Architecture and Parallel Processing. John Wiley & Sons, Inc. Hoboken, New Jersey, USA.

\\

  1. Briggs Faye A., Hwang Kai, 1984. Computer Architecture And Parallel Processing. McGraw-Hill, Inc. USA.

\\

  1. Wilkinson Barry, 1996. Computer Architecture Design and Performance. 2^nd^ Ed., Prentice Hall Europe, Englewood Cliffs, NJ.

\\

  1. Shiva Sajjan G. 1996. Pipelined and Parallel Computer Architectures. HarperCollins, NY, USA.

\\

  1. Sinapova Lydia, Operating Systems: Multiprocessor and Real-Time Scheduling. [1]

\\

  1. Squillante Mark S., Wang Fang , Papaefthymio Marios. An Analysis of Gang Scheduling for Multiprogrammed Parallel Computing Environments. Yale University, New Haven, CT.

\\

  1. [Gang Scheduling, Timesharing on Parallel Computers, SC98, November 1998 (a summary)|http://www.llnl.gov/asci/pse_trilab/sc98.summary.html]

Intel ISN, 2010. Granularity and Parallel Performance. [2]

\\

  1. Ding-Kai Chen, et al. The Impact of Synchronization and Granularity on Parallel Systems”, _Proceedings of the 17th Annual International Symposium on Computer Architecture 1990_, Seattle, Washington, USA.

\\

Interconnection Network Topologies and Routing Algorithms

  • CHAPTER 11 and 12 Reviews*

It does not discuss the evolution of interconnection networks, just gives a list of topologies. Which were developed first? What shortcomings did the newer network topologies attempt to address?

Although it has a couple of tables of interconnection-network characteristics, they are at the end, and not near the description of the networks themselves.

The texbook itself (p. 370) has a table of interconnection-network characteristics. In a supplement, it would be helpful to explain, clearly and consiely, how the expressions for diameter, bisection width, etc. have been derived.

To read the description, you would think that 2D meshes are uncompetitive. Yet they dominate--and for good reason. This needs to be explained.

\\\\

\\\\

\\\\

\\\\

  • Index*
  1. [Interconnection Network Architecture|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Interconnection_Network_Architecture]
    1. Interconnection networks: What? Where? Why?
    2. History of Interconnected Networks
  2. [Types of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Types_of_Network_Topologies]
    1. [Linear Array|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Linear_Array]
    2. [Ring|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Ring]
    3. [2-D Mesh|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#2-D_Mesh]
    4. [2-D Torus|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#2-D_Torus]
    5. [Cube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Cube]
    6. [Hypercube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Hypercube]
    7. [Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Tree]
    8. [Fat Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fat_Tree]
    9. [Butterfly|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Butterfly]
  3. [Real-World Implementation of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Real-World_Implementation_of_Network_Topologies]
    1. [Fat Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fat_Tree_2]
    2. [Butterfly|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Butterfly_2]
    3. [Meshes and Tori|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Meshes_and_Tori]
    4. [Hypercube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Hypercube_2]
  4. [Comparison of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Comparison_of_Network_Topologies]
  5. [Routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Routing]
    1. [Deadlock|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Deadlock]
    2. [Dimensional ordered (X-Y) routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Dimensional_ordered_.28X-Y.29_routing]
    3. [West First|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#West_First]
    4. [North Last|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#North_Last]
    5. [Negative First|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Negative_First]
    6. [Odd-Even Turn Model|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Odd-Even_Turn_Model]
  6. [Comparison of Turn Restriction Models|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Comparison_of_Turn_Restriction_Models]
  7. [Router Architecture|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Router_Architecture]
  8. [Fault Tolerant Routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Tolerant_Routing]
    1. [Fault Models|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Models]
    2. [Fault Tolerance Mechanisms (for permanent faults)|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Tolerance_Mechanisms_.28for_permanent_faults.29]
  9. Metrics for interconnection Network
  10. Characteristics of Networks with p Processors
  11. References

\\\\

[] Interconnection Network Architecture

  • Interconnection networks: What? Where? Why?*

Interconnection network is a programmable system that transports data between terminals. All the terminals are connected to the interconnection network. When one terminal wishes to send data to another terminal, terminal one sends a message containing the data into the network and network delivers message to another terminal. The network is programmable because it makes different connections at different points in time. The network is system because it is composed of many components:

  • Buffers
  • Channels
  • Switches
  • Controls

These components work together to deliver data.

Interconnection networks are used in most of the digital systems that are large enough to have two components to connect. The most common examples are the computer systems and communication switches. In computer systems, they connect processors to memories and I/O devices to I/O controllers. In 1980s, most of the applications were served by a very simple interconnection networks, example, the multi-drop bus. However, this has changed. All high performance interconnections are performed by point to point interconnection networks rather then buses, and more systems that have historically been bus-based switch to networks every year. This is because of non-uniform performance scaling. The demand for interconnection performance is increasing with processor performance and network bandwidth. Buses are not able to keep up with the bandwidth demand, and point-to-point interconnection networks which both operate faster than buses and offer concurrency.

Interconnection networks are limiting factors in many systems. The interconnection network between processor and memory largely determines the memory latency and memory bandwidth. The performance of the interconnection network in a communication switch largely determines the capacity of the switch. Because the demand for interconnection has grown more rapidly that the capability of the underlying wires, interconnection has become a critical bottleneck in most systems.

\\\\

  • History of Interconnection Networks*

Networks developed along following main three parallel threads:

  • Telephone switching networks
  • Inter-processor communication
  • Processor-memory interconnect

Early telephone networks built form the electro-mechanical crossbars or electro-mechanical step-by-step switches. As late as the 1980s, most local telephone switches were still built from electro-mechanical relays, although toll switches were completely electronic and digital by that time. Key developments in telephone switching was non-blocking, multistage Clos networks in 1953 and the Benes network in 1962. Many large telephone switches today are still built from Clos or Clos-like networks.

The first inter-processor interconnection networks were connections between the registers of neighboring processors connected in 2-D arrays. Example: Solomon machine developed in 1962. These early networks performed no routing. Therefore, processor has to explicitly relay communications to non-neighbors, making for poor performance and considerable programming complexity. By mid-1980s, router chips were developed to forward messages through intermediate nodes without processor interconnection. For example: Torus routing chip.

Inter-processor interconnection networks have gone through a series of topology fads over the years – largely motivated by packaging and other technology constraints. The early machines, like Solomon, Illian and MPP were based on simple 2-D mesh or torus networks because of their physical regularity. Starting in the late 1970s, binary n-cube or hypercube networks become popular because of their low diameter. For example, cosmic cube, the nCUBE networks. In the mid-1980s it was shown that under realistic packaging constraints low dimensional networks outperformed hypercubes and most machines returned to 2-D and 3-D mesh or torus networks. Consequently most networks built over the last decade have returned to these networks, including the J machine. For example, Cray T3D, T3E, intel DELTA. Today, the high pin bandwidth of router chips relative to message length motivates the use of networks with much higher node degree, such as butterfly and Clos network.

Processor memory interconnection networks emerge in the late 1960s when parallel processor systems incorporated alignment networks to allow any processor to access any memory bank without burdening the other processors. The smallest machines employed crossbar switches for this purpose, whereas larger machines used networks with a butterfly topology, in a dance-hall arrangement. Variations on this theme were used through the 1980s for many shared memory parallel processors.

The three interconnections network evolution recently merged. Since the 1990s, there has been little difference in the design of processor-memory and inter-processor interconnection networks. In fact, the same router chips have been used for both. A variant of Clos and Benes networks of telephony has also emerged in multiprocessor networks in the form of the fat free topology.

Interconnection Network Topology

In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.

Typically, in a multiprocessor system, message passed between processors are frequent and short^1^. Therefore, the interconnection network architecture must handle messages quickly by having low latency, and must handle several messages at a time and have high bandwidth.

In a network, a processor along with its cache and memory is considered a node. The physical wires that connect between them is called a link. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network topology. Topology is the pattern in which the individual switches of the network are connected to other switches and to processors and memories (nodes). Direct topologies connect each switch directly to a node, while in indirect topologies at least some of the switches connect only to other switches. Direct networks are 2D mesh of the 1970s-era Illiac IV' is similar to the 1990sera 2D mesh of the Intel Paragon and the toroidal 3D mesh of the CrayT3D' and T3E.

The indirect cube network, often referred to as a multistage interconnection nemork or MIN,617 has been used in a variety of machines. This topology was used in the Staran SIMD machine of the 1970s, in which data would traverse the network from one side to the other. Allowing data to reverse direction at any stage in a bidirectional MIN (or BMIN) leads to a variation sometimes called afat-tree. MINs and their fat-tree variants are used in MIMD machines of the 1990s such as the Meiko CS-2,9 the IBM SP2," and the Thinking Machines CM. Direct networks often excel at routing local traffic patterns such as the passing of boundary data in grids Indirect networks (such as k-node MINs) can provide a variety of global communication paths by passing though the multiple stages of switches.

[] Types of Network Topologies

[] Linear Array

Chapter%2012_html_1fb57514.png\\\\The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the diameter, is equivalent to the number of nodes. Linear network is easy to connect a computer or peripheral to a linear bus and requires less cable length than a star topology. Disadvantage of a linear bus topology is that entire network shuts down if there is a break in the main cable. Linear bus topology terminators are required at both ends of the backbone cable. It is difficult to identify the problem if the entire network shuts down. Linear topology is not meant to be used as a stand-alone solution in a large building.

[] Star

Chapter%2012_html_394ca1a2.png

Many home networks use star topology. A star network features a central connection point called a "hub" that can be switch, router or hub and terminals connected to it. Devices typically connect to the hub with Unshielded Twisted Pair (UTP) Ethernet.

\\

Compared to the bus topology, a star network generally requires more cable, but a failure in any star network cable will only take down one computer's network access and not the entire LAN. (If the hub fails, however, the entire network . Star topology is easy to install and wire. There are no disruptions to the network when connecting or removing devices. It is easy to detect faults and to remove parts.

Disadvantages of a Star topology are that it requires more cable length than a linear topology and more expensive than linear bus topologies because of the cost of the hubs. If the hub, switch, or concentrator fails, nodes attached are disabled.

\\

Ring

[Chapter%2012_html_f1d5830.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_ring.jpg]\\A network that uses a ring topology arranges for computer to be connected in a closed loop – a cable connects the first computer to a second computer, another cable connects the second computer to a third, and so on, until a cable connects the final computer to the third and so on until a cable connects the final computer back to the first.

Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half. A failure in any cable or device breaks the loop and can take down the entire network.

Mesh

The mesh topology incorporates a unique network design in which each computer on the network connects to every other, creating a point-to-point connection between every device on the network. The purpose of the mesh design is to provide a high level of redundancy. If one network cable fails, the data always has an alternative path to get to its destination. Figure 6 shows the mesh topology.

Chapter%2012_html_m57a68344.png

The wiring for a mesh network can be very complicated. Further, the cabling costs associated with the mesh topology can be high, and troubleshooting a failed cable can be tricky. Because of this, the mesh topology is rarely used. A variation on a true mesh topology is the hybrid mesh. It creates a redundant point-to-point network connection between only specific network devices. The hybrid mesh is most often seen in WAN implementations. Advantage of mesh topology is that it provides redundant paths between devices and network can be expanded without disruption to current users. Disadvantages of mesh network is that

it requires more cable than the other LAN topologies and it is complicated.

[] 2-D Mesh

[Chapter%2012_html_450973b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_2Dmesh.jpg]\\The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a degree of 4.

[] Designers like 2D meshes due to easy wiring layout and 2-D mesh have uniform and low wiring density throughout. A low wiring density means that no stringent constraints on channel width are placed. 2D Mesh networks provide a very simple network and lead to very short wires in the architecture. 2D Mesh is a very popular topology in Network on Chip due to its facilitated implementation, simplicity of the XY routing strategy and the network scalability. On the other hand, 2D Mesh has some disadvantages such as long network diameter as well as energy inefficiency because of the extra hops.

2-D Torus

[Chapter%2012_html_437ff2cb.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_2Dtorus.jpg]\\\\The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Adding wrap-around links to a mesh creates a torus topology which decreases the average and maximum hop counts and doubling the bisection bandwidth. The wrap-around links, however, also double the number of wiring channels per tile edge to 2. The disadvantage of long wires which span the length of the die is overcome by the technique of “folding” which yields a maximum wires length spanning only 2 tiles.

[] Cube

[Chapter%2012_html_61aeb58d.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_cube.jpg]\\The cube can be thought of as a three-dimensional mesh.

[] Hypercube

[Chapter%2012_html_m250307c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_hypercube.jpg]\\The hypercube is essentially multiple cubes put together.

[]\\\\

Tree

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels. Tree topologies integrate multiple star topologies together onto a bus. In its simplest form, only hub devices connect directly to the tree bus, and each hub functions as the "root" of a tree of devices. This bus/star hybrid approach supports future expandability of the network much better than a bus (limited in the number of devices due to the broadcast traffic it generates) or a star (limited by the number of hub connection points) alone.

[Chapter%2012_html_m48b3cb4.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_tree.jpg]\\In a tree network there is only one path between any two nodes. The taller the tree, the higher is communication bottleneck at high levels of the tree. Two remedies are possible:

Chapter%2012_html_m6d0d725c.png

We may have (a) static tree networks, or (b) dynamic tree networks. Alternatively, we may introduce fat tree networks (see below).

[] Fat Tree

[] In fat-tree topology as one moves up the tree becomes fatter towards the root.The fat-tree topology has many properties that make it attractive for large scale interconnects and system area networks. Most importantly the bisection bandwidth of the fat-tree topology scales linearly with the network size. The topology is also inherently highly resilient with a large number of redundant paths between two processing nodes. The fat-tree topology is very popular for building medium and large system area networks. In particular, it has been widely adopted by high performance. The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels. [Chapter%2012_html_m3dc4e0d5.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_fat_tree.jpg]\\\\\\

  • Butterfly*

[Chapter%2012_html_m5d54c52c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_butterfly.jpg]\\\\The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

  • MULTISTAGE NETWORKS*

A multistage network connects a number of processors to a number of memory banks, via a number of switches organized in layers, via:

\\

Chapter%2012_html_852fd46.png

Each switch can be in one of the following positions:

Chapter%2012_html_m2b1e2468.png

  • OMEGA*

Omega network connecting P processors to P memory banks (see below for P=8)

Chapter%2012_html_852fd46.png

Omega network has P/2 * log(P) switches, so the cost of this network is lower than the crossbar network.

  • OMEGA*

Omega network belongs to the class of blocking networks:Chapter%2012_html_m85a38d8.png

An Omega network can be static: switches may remain in fixed position\\(either straight-thru or criss-cross). An Omega network can also be used to connect processors to processors.

[] Real-World Implementation of Network Topologies

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network^2^. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between^2^.

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion. \\[Chapter%2012_html_m79c4c8c6.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_network.jpg]\\_Basic structure of Hospodor and Miller's experimental network_^2^

[] Fat Tree

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them^2^. \\\\The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors^7^.

[] Butterfly

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken^2^. \\[Chapter%2012_html_6efad034.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_butterfly.jpg]\\_Butterfly structure_^2^

[] Meshes and Tori

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures^2^. \\Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors^7^. \\\\[Chapter%2012_html_3535d7da.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_mesh.jpg]\\_Mesh structure_^2^\\\\[Chapter%2012_html_m4158c08a.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_torus.jpg]\\_Torus structure_^2^

[] Hypercube

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures. \\The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies. \\\\[Chapter%2012_html_4f87ec3a.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_hypercube.jpg]\\_Hypercube structure_^2^

[] Comparison of Network Topologies

The following table shows the total number of ports required for each network topology. \\\\[Chapter%2012_html_meb70cf2.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_ports.jpg]\\_Number of ports for each topology_^2^

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. \\Below the average path length, or average number of hops, and the average link load (GB/s) is shown. \\\\[Chapter%2012_html_m4f2c2c2b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_load.jpg]\\_Average path length and link load for each topology_^2^

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load. \\The figure below shows the cost of the network topologies. \\\\[Chapter%2012_html_a7666eb.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_cost.jpg]\\_Cost of each topology_^2^

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube. \\When the cost and average link load is factored the following graph is produced. \\\\[Chapter%2012_html_2554459.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_overall.jpg]\\_Overall cost of each topology_^2^

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

[] Routing

[] The routing algorithm determines what path a packet of data will take from source to destination. Routing can be deterministic, where the path is the same given a source and destination, or adaptive, where the path can change. The routing algorithm can also be partially adaptive where packets have multiple choices, but does not allow all packets to use the shortest path^3^.

Deadlock

[] When packets are in *deadlock* when they cannot continue to move through the nodes. The illustration below demonstrates this event. \\\\[Chapter%2012_html_75d0377.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_deadlock.jpg]\\_Example of deadlock_\\\\Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked. The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern. To avoid circular patterns of routing, some routing patterns are disallowed. These are called *turn restrictions*, where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.

Dimensional ordered (X-Y) routing

Turns from the y-dimension to the x-dimension are not allowed. Used in 2-D meshes

[] West First

Turns to the west are not allowed.

[] North Last

Turns after a north direction are not allowed.

[] Negative First

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.

[] Odd-Even Turn Model

Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer^3^.

Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models^3^. The model is designed primarily for 2-D meshes.

_Turns from the east to north direction from any node on an even column are not allowed._\\_Turns from the north to west direction from any node on an odd column are not allowed._\\_Turns from the east to south direction from any node on an even column are not allowed._\\_Turns from the south to west direction from any node on an odd column are not allowed._\\The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed. \\\\[Chapter%2012_html_36dd4f08.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_odd_even.jpg]\\_Odd-Even turn restriction model proposed by Ge-Ming Chiu_^3^

[]\\\\

Comparison of Turn Restriction Models

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and have a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.

\\[Chapter%2012_html_61c3f857.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_uniform.jpg]\\_Uniform traffic simulation of various turn restriction models_^3^\\\\The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency. \\\\[Chapter%2012_html_m621ec0cf.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_transpose.jpg]\\_First transpose traffic simulation of various turn restriction models_^3^\\The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models. [Chapter%2012_html_64e22c1c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_transpose2.jpg]\\_Second transpose traffic simulation of various turn restriction models_^3^

With the second transpose simulation, the odd-even model outperforms the rest.

\\\\[Chapter%2012_html_4b5295cc.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_hotspot.jpg]\\_Hotspot traffic simulation of various turn restriction models_^3^\\The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%. \\\\[Chapter%2012_html_5ab47e62.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_hotspot2.jpg]\\_Second hotspot traffic simulation of various turn restriction models_^3^\\When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.

[] Router Architecture

The router is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms. \\\\The internal architecture of a router consists of input and output ports and a crossbar switch. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer. Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years^4^.

\\[Chapter%2012_html_7a96163b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_bandwidth.jpg]_Bandwidth of various routers over 10 year period_^4^

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.

\\[Chapter%2012_html_m7465c12f.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_physical.jpg]\\_Router hardware over period of time_^4^

\\[Chapter%2012_html_m57de6676.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_radix.jpg]\\_Radix and latency of routers over 10 year period_^4^

The *radix*, or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.

With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.

[] Fault Tolerant Routing

[] Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components^6^. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.

Fault Models

Faults in a network can be categorized in two types: \\1.*Transient Faults*^5^ : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate. \\2.*Permanent Faults*^5^: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.

[] Fault Tolerance Mechanisms (for permanent faults)

The permanent faults can be handled using one of the two mechanisms: \\1.*Static Mechanism*: In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path. \\2. *Dynamic Mechanisms*: In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are: \\a.*Block Faults*: In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG). \\*Disadvantage*: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity. \\[Chapter%2012_html_m634e3930.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Fault_pic1.jpg]\\b.*Fault Rings*: This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them. \\\\[Chapter%2012_html_62ea40b2.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Fault_pic2.jpg]

[]\\

  • Metrics for Interconnection Networks*

• *Diameter*: longest distance between two nodes in the network

• *Bisection Width*: Minimum of wire cuts to divide the network in 2 halves. Examples

• *Cost*: Number of links or switches (whichever is asymptotically higher)

\\

  • Characteristics of Networks with p Processors *

Network

Diameter

Bisection Width

Arc Connectivity

Number of Links

Completey-connected

1

Chapter%2012_html_m90983b8.png

p-1

p(p-1)/2

Star

2

1

1

p-1

Complete binary tree

2lg((p+1)/2)

1

1

p-1

Linear array

p-1

1

1

p-1

ring

Chapter%2012_html_m2e31b565.png

2

2

p

2-D mesh no wrap

Chapter%2012_html_m6d9479b4.png

Chapter%2012_html_m1ad11004.png

2

Chapter%2012_html_6a7cfe12.png

2-D mesh with wrap

Chapter%2012_html_m242b304a.png

p/2

log p

2p

Hypercube

log(p)

p/2

log p

(plog(p))/2

\\\\

\\\\

References

  1. Yan Solihin. Fundamentals of Parallel Computer Architecture.
  2. Comer, Douglas E. Computer Networks and Internets.
  3. Comer, Douglas E. Internetworking with TCP/IP. Volume I.
  4. Dally, William James, Towles Brian. Principles and Practices of Interconnection Networks.
  5. [Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems|http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf]
  6. [The Odd-Even Turn Model for Adaptive Routing|http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf]
  7. [Interconnection Topologies:(Historical Trends and Comparisons)|http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf]
  8. [Efficient mechanisms to provide fault tolerance in interconnection networks for|http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1]
  9. [PC clusters, José Miguel Montañana Aliaga.|http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1]
  10. [Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad|http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4]
  11. [TOP500 Supercomputing Sites|http://www.top500.org/]

[] Chapter%2012_html_m53d4ecad.gifChapter%2012_html_m53d4ecad.gif

Interconnection Network Topologies and Routing Algorithms

  • CHAPTER 11 and 12 Reviews*

It does not discuss the evolution of interconnection networks, just gives a list of topologies. Which were developed first? What shortcomings did the newer network topologies attempt to address?

Although it has a couple of tables of interconnection-network characteristics, they are at the end, and not near the description of the networks themselves.

The texbook itself (p. 370) has a table of interconnection-network characteristics. In a supplement, it would be helpful to explain, clearly and consiely, how the expressions for diameter, bisection width, etc. have been derived.

To read the description, you would think that 2D meshes are uncompetitive. Yet they dominate--and for good reason. This needs to be explained.

\\\\

\\\\

\\\\

\\\\

  • Index*
  1. [Interconnection Network Architecture|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Interconnection_Network_Architecture]
    1. Interconnection networks: What? Where? Why?
    2. History of Interconnected Networks
  2. [Types of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Types_of_Network_Topologies]
    1. [Linear Array|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Linear_Array]
    2. [Ring|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Ring]
    3. [2-D Mesh|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#2-D_Mesh]
    4. [2-D Torus|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#2-D_Torus]
    5. [Cube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Cube]
    6. [Hypercube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Hypercube]
    7. [Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Tree]
    8. [Fat Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fat_Tree]
    9. [Butterfly|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Butterfly]
  3. [Real-World Implementation of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Real-World_Implementation_of_Network_Topologies]
    1. [Fat Tree|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fat_Tree_2]
    2. [Butterfly|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Butterfly_2]
    3. [Meshes and Tori|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Meshes_and_Tori]
    4. [Hypercube|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Hypercube_2]
  4. [Comparison of Network Topologies|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Comparison_of_Network_Topologies]
  5. [Routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Routing]
    1. [Deadlock|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Deadlock]
    2. [Dimensional ordered (X-Y) routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Dimensional_ordered_.28X-Y.29_routing]
    3. [West First|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#West_First]
    4. [North Last|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#North_Last]
    5. [Negative First|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Negative_First]
    6. [Odd-Even Turn Model|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Odd-Even_Turn_Model]
  6. [Comparison of Turn Restriction Models|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Comparison_of_Turn_Restriction_Models]
  7. [Router Architecture|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Router_Architecture]
  8. [Fault Tolerant Routing|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Tolerant_Routing]
    1. [Fault Models|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Models]
    2. [Fault Tolerance Mechanisms (for permanent faults)|http://pg-server.csc.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Spring_2010/12_EC#Fault_Tolerance_Mechanisms_.28for_permanent_faults.29]
  9. Metrics for interconnection Network
  10. Characteristics of Networks with p Processors
  11. References

\\\\

[] Interconnection Network Architecture

  • Interconnection networks: What? Where? Why?*

Interconnection network is a programmable system that transports data between terminals. All the terminals are connected to the interconnection network. When one terminal wishes to send data to another terminal, terminal one sends a message containing the data into the network and network delivers message to another terminal. The network is programmable because it makes different connections at different points in time. The network is system because it is composed of many components:

  • Buffers
  • Channels
  • Switches
  • Controls

These components work together to deliver data.

Interconnection networks are used in most of the digital systems that are large enough to have two components to connect. The most common examples are the computer systems and communication switches. In computer systems, they connect processors to memories and I/O devices to I/O controllers. In 1980s, most of the applications were served by a very simple interconnection networks, example, the multi-drop bus. However, this has changed. All high performance interconnections are performed by point to point interconnection networks rather then buses, and more systems that have historically been bus-based switch to networks every year. This is because of non-uniform performance scaling. The demand for interconnection performance is increasing with processor performance and network bandwidth. Buses are not able to keep up with the bandwidth demand, and point-to-point interconnection networks which both operate faster than buses and offer concurrency.

Interconnection networks are limiting factors in many systems. The interconnection network between processor and memory largely determines the memory latency and memory bandwidth. The performance of the interconnection network in a communication switch largely determines the capacity of the switch. Because the demand for interconnection has grown more rapidly that the capability of the underlying wires, interconnection has become a critical bottleneck in most systems.

\\\\

  • History of Interconnection Networks*

Networks developed along following main three parallel threads:

  • Telephone switching networks
  • Inter-processor communication
  • Processor-memory interconnect

Early telephone networks built form the electro-mechanical crossbars or electro-mechanical step-by-step switches. As late as the 1980s, most local telephone switches were still built from electro-mechanical relays, although toll switches were completely electronic and digital by that time. Key developments in telephone switching was non-blocking, multistage Clos networks in 1953 and the Benes network in 1962. Many large telephone switches today are still built from Clos or Clos-like networks.

The first inter-processor interconnection networks were connections between the registers of neighboring processors connected in 2-D arrays. Example: Solomon machine developed in 1962. These early networks performed no routing. Therefore, processor has to explicitly relay communications to non-neighbors, making for poor performance and considerable programming complexity. By mid-1980s, router chips were developed to forward messages through intermediate nodes without processor interconnection. For example: Torus routing chip.

Inter-processor interconnection networks have gone through a series of topology fads over the years – largely motivated by packaging and other technology constraints. The early machines, like Solomon, Illian and MPP were based on simple 2-D mesh or torus networks because of their physical regularity. Starting in the late 1970s, binary n-cube or hypercube networks become popular because of their low diameter. For example, cosmic cube, the nCUBE networks. In the mid-1980s it was shown that under realistic packaging constraints low dimensional networks outperformed hypercubes and most machines returned to 2-D and 3-D mesh or torus networks. Consequently most networks built over the last decade have returned to these networks, including the J machine. For example, Cray T3D, T3E, intel DELTA. Today, the high pin bandwidth of router chips relative to message length motivates the use of networks with much higher node degree, such as butterfly and Clos network.

Processor memory interconnection networks emerge in the late 1960s when parallel processor systems incorporated alignment networks to allow any processor to access any memory bank without burdening the other processors. The smallest machines employed crossbar switches for this purpose, whereas larger machines used networks with a butterfly topology, in a dance-hall arrangement. Variations on this theme were used through the 1980s for many shared memory parallel processors.

The three interconnections network evolution recently merged. Since the 1990s, there has been little difference in the design of processor-memory and inter-processor interconnection networks. In fact, the same router chips have been used for both. A variant of Clos and Benes networks of telephony has also emerged in multiprocessor networks in the form of the fat free topology.

Interconnection Network Topology

In a multi-processor system, processors need to communicate with each other and access each other's resources. In order to route data and messages between processors, an interconnection architecture is needed.

Typically, in a multiprocessor system, message passed between processors are frequent and short^1^. Therefore, the interconnection network architecture must handle messages quickly by having low latency, and must handle several messages at a time and have high bandwidth.

In a network, a processor along with its cache and memory is considered a node. The physical wires that connect between them is called a link. The device that routes messages between nodes is called a router. The shape of the network, such as the number of links and routers, is called the network topology. Topology is the pattern in which the individual switches of the network are connected to other switches and to processors and memories (nodes). Direct topologies connect each switch directly to a node, while in indirect topologies at least some of the switches connect only to other switches. Direct networks are 2D mesh of the 1970s-era Illiac IV' is similar to the 1990sera 2D mesh of the Intel Paragon and the toroidal 3D mesh of the CrayT3D' and T3E.

The indirect cube network, often referred to as a multistage interconnection nemork or MIN,617 has been used in a variety of machines. This topology was used in the Staran SIMD machine of the 1970s, in which data would traverse the network from one side to the other. Allowing data to reverse direction at any stage in a bidirectional MIN (or BMIN) leads to a variation sometimes called afat-tree. MINs and their fat-tree variants are used in MIMD machines of the 1990s such as the Meiko CS-2,9 the IBM SP2," and the Thinking Machines CM. Direct networks often excel at routing local traffic patterns such as the passing of boundary data in grids Indirect networks (such as k-node MINs) can provide a variety of global communication paths by passing though the multiple stages of switches.

[] Types of Network Topologies

[] Linear Array

Chapter%2012_html_1fb57514.png\\\\The nodes are connected linearly as in an array. This type of topology is simple, however, it does not scale well. The longest distance between two nodes, or the diameter, is equivalent to the number of nodes. Linear network is easy to connect a computer or peripheral to a linear bus and requires less cable length than a star topology. Disadvantage of a linear bus topology is that entire network shuts down if there is a break in the main cable. Linear bus topology terminators are required at both ends of the backbone cable. It is difficult to identify the problem if the entire network shuts down. Linear topology is not meant to be used as a stand-alone solution in a large building.

[] Star

Chapter%2012_html_394ca1a2.png

Many home networks use star topology. A star network features a central connection point called a "hub" that can be switch, router or hub and terminals connected to it. Devices typically connect to the hub with Unshielded Twisted Pair (UTP) Ethernet.

\\

Compared to the bus topology, a star network generally requires more cable, but a failure in any star network cable will only take down one computer's network access and not the entire LAN. (If the hub fails, however, the entire network . Star topology is easy to install and wire. There are no disruptions to the network when connecting or removing devices. It is easy to detect faults and to remove parts.

Disadvantages of a Star topology are that it requires more cable length than a linear topology and more expensive than linear bus topologies because of the cost of the hubs. If the hub, switch, or concentrator fails, nodes attached are disabled.

\\

Ring

[Chapter%2012_html_f1d5830.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_ring.jpg]\\A network that uses a ring topology arranges for computer to be connected in a closed loop – a cable connects the first computer to a second computer, another cable connects the second computer to a third, and so on, until a cable connects the final computer to the third and so on until a cable connects the final computer back to the first.

Similar structure as the linear array, except, the ending nodes connect to each other, establishing a circular structure. The longest distance between two nodes is cut in half. A failure in any cable or device breaks the loop and can take down the entire network.

Mesh

The mesh topology incorporates a unique network design in which each computer on the network connects to every other, creating a point-to-point connection between every device on the network. The purpose of the mesh design is to provide a high level of redundancy. If one network cable fails, the data always has an alternative path to get to its destination. Figure 6 shows the mesh topology.

Chapter%2012_html_m57a68344.png

The wiring for a mesh network can be very complicated. Further, the cabling costs associated with the mesh topology can be high, and troubleshooting a failed cable can be tricky. Because of this, the mesh topology is rarely used. A variation on a true mesh topology is the hybrid mesh. It creates a redundant point-to-point network connection between only specific network devices. The hybrid mesh is most often seen in WAN implementations. Advantage of mesh topology is that it provides redundant paths between devices and network can be expanded without disruption to current users. Disadvantages of mesh network is that

it requires more cable than the other LAN topologies and it is complicated.

[] 2-D Mesh

[Chapter%2012_html_450973b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_2Dmesh.jpg]\\The 2-D mesh can be thought of as several linear arrays put together to form a 2-dimensional structure. Nodes that are not on the edge have 4 input or output links, or a degree of 4.

[] Designers like 2D meshes due to easy wiring layout and 2-D mesh have uniform and low wiring density throughout. A low wiring density means that no stringent constraints on channel width are placed. 2D Mesh networks provide a very simple network and lead to very short wires in the architecture. 2D Mesh is a very popular topology in Network on Chip due to its facilitated implementation, simplicity of the XY routing strategy and the network scalability. On the other hand, 2D Mesh has some disadvantages such as long network diameter as well as energy inefficiency because of the extra hops.

2-D Torus

[Chapter%2012_html_437ff2cb.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_2Dtorus.jpg]\\\\The 2-D torus takes the structure of the 2-D mesh and connects the nodes on the edges. This decreases the diameter, but the number of links is higher. Adding wrap-around links to a mesh creates a torus topology which decreases the average and maximum hop counts and doubling the bisection bandwidth. The wrap-around links, however, also double the number of wiring channels per tile edge to 2. The disadvantage of long wires which span the length of the die is overcome by the technique of “folding” which yields a maximum wires length spanning only 2 tiles.

[] Cube

[Chapter%2012_html_61aeb58d.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_cube.jpg]\\The cube can be thought of as a three-dimensional mesh.

[] Hypercube

[Chapter%2012_html_m250307c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_hypercube.jpg]\\The hypercube is essentially multiple cubes put together.

[]\\\\

Tree

The tree is a hierarchical structure nodes on the bottom and switching nodes at the upper levels. The tree experiences high traffic at the upper levels. Tree topologies integrate multiple star topologies together onto a bus. In its simplest form, only hub devices connect directly to the tree bus, and each hub functions as the "root" of a tree of devices. This bus/star hybrid approach supports future expandability of the network much better than a bus (limited in the number of devices due to the broadcast traffic it generates) or a star (limited by the number of hub connection points) alone.

[Chapter%2012_html_m48b3cb4.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_tree.jpg]\\In a tree network there is only one path between any two nodes. The taller the tree, the higher is communication bottleneck at high levels of the tree. Two remedies are possible:

Chapter%2012_html_m6d0d725c.png

We may have (a) static tree networks, or (b) dynamic tree networks. Alternatively, we may introduce fat tree networks (see below).

[] Fat Tree

[] In fat-tree topology as one moves up the tree becomes fatter towards the root.The fat-tree topology has many properties that make it attractive for large scale interconnects and system area networks. Most importantly the bisection bandwidth of the fat-tree topology scales linearly with the network size. The topology is also inherently highly resilient with a large number of redundant paths between two processing nodes. The fat-tree topology is very popular for building medium and large system area networks. In particular, it has been widely adopted by high performance. The fat tree alleviates the traffic at upper levels by "fattening" up the links at the upper levels. [Chapter%2012_html_m3dc4e0d5.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_fat_tree.jpg]\\\\\\

  • Butterfly*

[Chapter%2012_html_m5d54c52c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Top_butterfly.jpg]\\\\The butterfly structure is similar to the tree structure, but it replicates the switching node structure of the tree topology and connects them together so that there are equal links and routers at all levels.

  • MULTISTAGE NETWORKS*

A multistage network connects a number of processors to a number of memory banks, via a number of switches organized in layers, via:

\\

Chapter%2012_html_852fd46.png

Each switch can be in one of the following positions:

Chapter%2012_html_m2b1e2468.png

  • OMEGA*

Omega network connecting P processors to P memory banks (see below for P=8)

Chapter%2012_html_852fd46.png

Omega network has P/2 * log(P) switches, so the cost of this network is lower than the crossbar network.

  • OMEGA*

Omega network belongs to the class of blocking networks:Chapter%2012_html_m85a38d8.png

An Omega network can be static: switches may remain in fixed position\\(either straight-thru or criss-cross). An Omega network can also be used to connect processors to processors.

[] Real-World Implementation of Network Topologies

In a research study by Andy Hospodor and Ethan Miller, several network topologies were investigated in a high-performance, high-traffic network^2^. Several topologies were investigated including the fat tree, butterfly, mesh, torii, and hypercube structures. Advantages and disadvantages including cost, performance, and reliability were discussed.

In this experiment, a petabyte-scale network with over 100 GB/s total aggregate bandwidth was investigated. The network consisted of 4096 disks with large servers with routers and switches in between^2^.

The overall structure of the network is shown below. Note that this structure is very susceptible to failure and congestion. \\[Chapter%2012_html_m79c4c8c6.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_network.jpg]\\_Basic structure of Hospodor and Miller's experimental network_^2^

[] Fat Tree

In large scale, high performance applications, fat tree can be a choice. However, in order to "fatten" up the links, redundant connections must be used. Instead of using one link between switching nodes, several must be used. The problem with this is that with more input and output links, one would need routers with more input and output ports. Router with excess of 100 ports are difficult to build and expensive, so multiple routers would have to be stacked together. Still, the routers would be expensive and would require several of them^2^. \\\\The Japan Agency for Marine-Earth Science and Technology supercomputing system uses the fat tree topology. The system connects 1280 processors using NEC processors^7^.

[] Butterfly

In high performance applications, the butterfly structure is a good choice. The butterfly topology uses fewer links than other topologies, however, each link carries traffic from the entire layer. Fault tolerance is poor. There exists only a single path between pairs of nodes. Should the link break, data cannot be re-routed, and communication is broken^2^. \\[Chapter%2012_html_6efad034.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_butterfly.jpg]\\_Butterfly structure_^2^

[] Meshes and Tori

The mesh and torus structure used in this application would require a large number of links and total aggregate of several thousands of ports. However, since there are so many links, the mesh and torus structures provide alternates paths in case of failures^2^. \\Some examples of current use of torus structure include the QPACE SFB TR Cluster in Germany using the PowerXCell 8i processors. The systems uses 3-D torus topology with 4608 processors^7^. \\\\[Chapter%2012_html_3535d7da.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_mesh.jpg]\\_Mesh structure_^2^\\\\[Chapter%2012_html_m4158c08a.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_torus.jpg]\\_Torus structure_^2^

[] Hypercube

Similar to the torii structures, the hypercube requires larger number of links. However, the bandwidth scales better than mesh and torii structures. \\The CRAY T3E, CRAY XT3, and SGI Origin 2000 use k-ary n-cubed topologies. \\\\[Chapter%2012_html_4f87ec3a.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_hypercube.jpg]\\_Hypercube structure_^2^

[] Comparison of Network Topologies

The following table shows the total number of ports required for each network topology. \\\\[Chapter%2012_html_meb70cf2.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_ports.jpg]\\_Number of ports for each topology_^2^

As the figure above shows, the 6-D hypercube requires the largest number of ports, due to its relatively complex six-dimensional structure. In contrast, the fat tree requires the least number of ports, even though links have been "fattened" up by using redundant links. The butterfly network requires more than twice the number of ports as the fat tree, since it essentially replicates the switching layer of the fat tree. The number of ports for the mesh and torii structures increase as the dimensionality increases. \\Below the average path length, or average number of hops, and the average link load (GB/s) is shown. \\\\[Chapter%2012_html_m4f2c2c2b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_load.jpg]\\_Average path length and link load for each topology_^2^

Looking at the trends, when average path length is high, the average link load is also high. In other words, average path length and average link load are proportionally related. It is obvious from the graph that 2-D mesh has, by far, the worst performance. In a large network such as this, the average path length is just too high, and the average link load suffers. For this type of high-performance network, the 2-D mesh does not scale well. Likewise the 2-D torus cuts the average path length and average link load in half by connected the edge nodes together, however, the performance compared to other types is relatively poor. The butterfly and fat-tree have the least average path length and average link load. \\The figure below shows the cost of the network topologies. \\\\[Chapter%2012_html_a7666eb.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_cost.jpg]\\_Cost of each topology_^2^

Despite using the fewest number of ports, the fat tree topology has the highest cost, by far. Although it uses the fewest ports, the ports are high bandwidth ports of 10 GB/s. Over 2400, ports of 10 GB/s are required have enough bandwidth at the upper levels of the tree. This pushes the cost up dramatically, and from a cost standpoint is impractical. While the total cost of fat tree is about 15 million dollars, the rest of the network topologies are clustered below 4 million dollars. When the dimensionalality of the mesh and torii structures increase, the cost increases. The butterfly network costs between the 2-D mesh/torii and the 6-D hypercube. \\When the cost and average link load is factored the following graph is produced. \\\\[Chapter%2012_html_2554459.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Disknet_overall.jpg]\\_Overall cost of each topology_^2^

From the figure above, the 6-D hypercube demonstrates the most cost effective choice on this particular network setup. Although the 6-D hypercube costs more because it needs more links and ports, it provides higher bandwidth, which can offset the higher cost. The high dimensional torii also perform well, but cannot provide as much bandwidth as the 6-D hypercube. For systems that do not need as much bandwidth, the high-dimensional torii is also a good choice. The butterfly topology is also an alternative, but has lower fault tolerance.

[] Routing

[] The routing algorithm determines what path a packet of data will take from source to destination. Routing can be deterministic, where the path is the same given a source and destination, or adaptive, where the path can change. The routing algorithm can also be partially adaptive where packets have multiple choices, but does not allow all packets to use the shortest path^3^.

Deadlock

[] When packets are in *deadlock* when they cannot continue to move through the nodes. The illustration below demonstrates this event. \\\\[Chapter%2012_html_75d0377.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_deadlock.jpg]\\_Example of deadlock_\\\\Assume that all of the buffers are full at each node. Packet from Node 1 cannot continue to Node 2. The packet from Node 2 cannot continue to Node 3, and so on. Since packet cannot move, it is deadlocked. The deadlock occurs from cyclic pattern of routing. To avoid deadlock, avoid circular routing pattern. To avoid circular patterns of routing, some routing patterns are disallowed. These are called *turn restrictions*, where some turns are not allowed in order to avoid making a circular routing pattern. Some of these turn restrictions are mentioned below.

Dimensional ordered (X-Y) routing

Turns from the y-dimension to the x-dimension are not allowed. Used in 2-D meshes

[] West First

Turns to the west are not allowed.

[] North Last

Turns after a north direction are not allowed.

[] Negative First

Turns in the negative direction (-x or -y) are not allowed, except on the first turn.

[] Odd-Even Turn Model

Unfortunately, the above turn-restriction models reduce the degree of adaptiveness and are partially adaptive. The models cause some packets to take different routes, and not necessarily the minimal paths. This may cause unfairness but reduces the ability of the system to reduce congestion. Overall performance could suffer^3^.

Ge-Ming Chiu introduces the Odd-Even turn model as an adaptive turn restriction, deadlock-free model that has better performance than the previously mentioned models^3^. The model is designed primarily for 2-D meshes.

_Turns from the east to north direction from any node on an even column are not allowed._\\_Turns from the north to west direction from any node on an odd column are not allowed._\\_Turns from the east to south direction from any node on an even column are not allowed._\\_Turns from the south to west direction from any node on an odd column are not allowed._\\The illustration below shows allowed routing for different source and destination nodes. Depending on which column the packet is in, only certain directions are allowed. \\\\[Chapter%2012_html_36dd4f08.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_odd_even.jpg]\\_Odd-Even turn restriction model proposed by Ge-Ming Chiu_^3^

[]\\\\

Comparison of Turn Restriction Models

To simulate the performance of various turn restriction models, Chiu simulated a 15 x 15 mesh under various traffic patterns. All channels have bandwidth of 20 flits/usec and have a buffer size of one flit. The dimension-ordered x-y routing, west-first, and negative-first models were compared against the odd-even model.

Traffic patterns including uniform, transpose, and hot spot were conducted. Uniform simulates one node send messages to any other node with equal probability. Transpose simulates two opposite nodes sending messages to their respective halves of the mesh. Hot spot simulates a few "hot spot" nodes that receive high traffic.

\\[Chapter%2012_html_61c3f857.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_uniform.jpg]\\_Uniform traffic simulation of various turn restriction models_^3^\\\\The performance of the different routing algorithms is shown above for the uniform traffic. For uniform traffic, the dimensional ordered x-y model outperforms the rest of the models. As the number of messages increase, the x-y model has the "slowest" increase in average communication latency. \\\\[Chapter%2012_html_m621ec0cf.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_transpose.jpg]\\_First transpose traffic simulation of various turn restriction models_^3^\\The performance of the different routing algorithms is shown above for the first transpose traffic. The negative-first model has the best performance, while the odd-even model performs better than the west-first and x-y models. [Chapter%2012_html_64e22c1c.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_transpose2.jpg]\\_Second transpose traffic simulation of various turn restriction models_^3^

With the second transpose simulation, the odd-even model outperforms the rest.

\\\\[Chapter%2012_html_4b5295cc.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_hotspot.jpg]\\_Hotspot traffic simulation of various turn restriction models_^3^\\The performance of the different routing algorithms is shown above for the hotspot traffic. Only one hotspot was simulated for this test. The performance of the odd-even model outperforms other models when hotspot traffic is 10%. \\\\[Chapter%2012_html_5ab47e62.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Routing_hotspot2.jpg]\\_Second hotspot traffic simulation of various turn restriction models_^3^\\When the number of hotspots is increased to five, the performance of the odd-even begins to shine. The latency is lowest for both 6 and 8 percent hotspot. Meanwhile, the performance of x-y model is horrendous.

While the x-y model performs well in uniform traffic, it lacks adaptiveness. When traffic becomes hotspot, the x-y model suffers from the inability to adapt and re-route traffic to avoid the congestion caused by hotspots. The odd-even model has superior adaptiveness under high congestion.

[] Router Architecture

The router is a device that routes incoming data to its destination. It does this by having several input ports and several output ports. Data incoming from one of the inputs ports is routed to one of the output ports. Which output port is chosen depends on the destination of the data, and the routing algorithms. \\\\The internal architecture of a router consists of input and output ports and a crossbar switch. The crossbar switch connects the selects which output should be selected, acting essentially as a multiplexer. Router technology has improved significantly over the years. This has allowed networks with high dimensionality to become feasible. As shown in the real-world example above, high dimensional torii and hypercube are excellent choice of topology for high-performance networks. The cost of high-performance, high-radix routers has contributed to the viability of these types of high dimensionality networks. As the graph below shows, the bandwidth of routers has improved tremendously over a period of 10 years^4^.

\\[Chapter%2012_html_7a96163b.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_bandwidth.jpg]_Bandwidth of various routers over 10 year period_^4^

Looking at the physical architecture and layout of router, it is evident that the circuitry has been dramatically more dense and complex.

\\[Chapter%2012_html_m7465c12f.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_physical.jpg]\\_Router hardware over period of time_^4^

\\[Chapter%2012_html_m57de6676.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Router_radix.jpg]\\_Radix and latency of routers over 10 year period_^4^

The *radix*, or the number of ports of routers has also increased. The current technology not only has high radix, but also low latency compared to last generation. As radix increases, the latency remains steady.

With high-performance routers, complex topologies are possible. As the router technology improves, more complex, high-dimensionality topologies are possible.

[] Fault Tolerant Routing

[] Fault-tolerant routing means the successful routing of messages between any pair of non faulty nodes in the presence of faulty components^6^. With increased number of processors in a multiprocessor system and high data rates reliable transmission of data in event of network fault is of great concern and hence fault tolerant routing algorithms are important.

Fault Models

Faults in a network can be categorized in two types: \\1.*Transient Faults*^5^ : A transient fault is a temporary fault that occurs for a very short duration of time. This fault can be caused due to change in output of flip-flop leading to generation of invalid header. These faults can be minimized using error controlled coding. These errors are generally evaluated in terms of Bit Error Rate. \\2.*Permanent Faults*^5^: A permanent fault is a fault that does not go away and causes a permanent damage to the network. This fault could be due to damaged wires and associated circuitry. These faults are generally evaluated in terms of Mean Time between Failures.

[] Fault Tolerance Mechanisms (for permanent faults)

The permanent faults can be handled using one of the two mechanisms: \\1.*Static Mechanism*: In static fault tolerance model, once the fault is detected all the processes running in the system are stopped and the routing tables are emptied. Based on the information of faults the routing tables are re-calculated to provide a fault free path. \\2. *Dynamic Mechanisms*: In dynamic fault tolerance model, it is made sure that the operation of the processes in the network is not completely stalled and only the affected regions are provided cure. Some of the methods to do this are: \\a.*Block Faults*: In this method many of the healthy nodes in vicinity of the faulty nodes are marked as faulty nodes so that no routes are created close to the actual faulty nodes. The shape of the region could be convex or non-convex, and is made sure that none of the new routes introduce cyclic dependency in the cyclic dependency graph (CDG). \\*Disadvantage*: This method causes lot of healthy nodes to be declared as faulty leading to reduction in system capacity. \\[Chapter%2012_html_m634e3930.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Fault_pic1.jpg]\\b.*Fault Rings*: This method was introduced by Chalasani and Boppana. A fault tolerant ring is a set of nodes and links that are adjunct to faulty nodes/links. This approach reduces the number of healthy nodes to be marked as faulty and blocking them. \\\\[Chapter%2012_html_62ea40b2.jpg|http://pg-server.csc.ncsu.edu/mediawiki/index.php/Image:Fault_pic2.jpg]

[]\\

  • Metrics for Interconnection Networks*

• *Diameter*: longest distance between two nodes in the network

• *Bisection Width*: Minimum of wire cuts to divide the network in 2 halves. Examples

• *Cost*: Number of links or switches (whichever is asymptotically higher)

\\

  • Characteristics of Networks with p Processors *

Network

Diameter

Bisection Width

Arc Connectivity

Number of Links

Completey-connected

1

Chapter%2012_html_m90983b8.png

p-1

p(p-1)/2

Star

2

1

1

p-1

Complete binary tree

2lg((p+1)/2)

1

1

p-1

Linear array

p-1

1

1

p-1

ring

Chapter%2012_html_m2e31b565.png

2

2

p

2-D mesh no wrap

Chapter%2012_html_m6d9479b4.png

Chapter%2012_html_m1ad11004.png

2

Chapter%2012_html_6a7cfe12.png

2-D mesh with wrap

Chapter%2012_html_m242b304a.png

p/2

log p

2p

Hypercube

log(p)

p/2

log p

(plog(p))/2

\\\\

\\\\

References

  1. Yan Solihin. Fundamentals of Parallel Computer Architecture.
  2. Comer, Douglas E. Computer Networks and Internets.
  3. Comer, Douglas E. Internetworking with TCP/IP. Volume I.
  4. Dally, William James, Towles Brian. Principles and Practices of Interconnection Networks.
  5. [Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems|http://www.ssrc.ucsc.edu/Papers/hospodor-mss04.pdf]
  6. [The Odd-Even Turn Model for Adaptive Routing|http://www.diit.unict.it/~vcatania/COURSES/semm_05-06/DOWNLOAD/noc_routing02.pdf]
  7. [Interconnection Topologies:(Historical Trends and Comparisons)|http://www.csm.ornl.gov/workshops/IAA-IC-Workshop-08/documents/wiki/dally_iaa_workshop_0708.pdf]
  8. [Efficient mechanisms to provide fault tolerance in interconnection networks for|http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1]
  9. [PC clusters, José Miguel Montañana Aliaga.|http://dspace.upv.es/xmlui/bitstream/handle/10251/2603/tesisUPV2824.pdf?sequence=1]
  10. [Adaptive Fault Tolerant Routing Algorithm for Tree-Hypercube Multicomputer, Qatawneh Mohammad|http://web.ebscohost.com.www.lib.ncsu.edu:2048/ehost/pdfviewer/pdfviewer?vid=2&hid=15&sid=72e3828d-3cb1-42b9-8198-5c1e974ea53f@sessionmgr4]
  11. [TOP500 Supercomputing Sites|http://www.top500.org/]

[] Chapter%2012_html_m53d4ecad.gifChapter%2012_html_m53d4ecad.gif