Chapter 4b CSC/ECE 506 Spring 2011 / ch4b

From Expertiza_Wiki
Revision as of 19:43, 24 April 2011 by Dbharga (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Chapter 4b CSC/ECE 506 Spring 2011 / ch4b

As a book on computer architecture, the text does not go into a great deal of detail on parallel programming techniques. Many of the topics that are mentioned in Chapter 4 can be explored in greater depth. The granularity of parallel threads depends quite closely on the architecture. The granularity of synchronization is a topic that has been studied for many years. There are times when, due to the overhead of locking, it does not make sense to invoke a lock, but rather, just to wait in a loop until a specific predicate is satisfied. This can be considered in more detail, but be sure not to overlap the discussion of lock implementations in Chapter 9 of Solihin.

The textbook considers scheduling and load balancing in somewhat greater detail, but much more can be said. For example, one might consider when [gang scheduling|http://en.wikipedia.org/wiki/Gang_scheduling] is useful; this depends on the overhead of communication and task management. The same is true of load balancing; discuss the tradeoff between static and dynamic scheduling, and different approaches to dynamic scheduling.


Index

  1. Introduction
  2. Parallel Thread Granularity
    1. Instruction Level Parallelism:
    2. Loop Level Parallelism
    3. Procedure Level Parallelism:
    4. Subprogram Level Parallelism
    5. Job level parallelism
  3. Scheduling
  4. Load Balancing
  5. Gang Scheduling


Introduction

The granularity of parallel threads depends quite closely on the architecture. For attaining good parallel performance it is important to use the right granularity. Granularity can be defined as the amount of work done by a task, between communications with other tasks. Granularity of parallel threads is achieved at the expense of communication overhead, load balance and extra work to be done by threads. If granularity is too fine, then performance can suffer from communication overhead. If granularity is too coarse, then performance can suffer from load imbalance. If granularity is too fine, the overhead of creating and communicating with threads is too great. There is a middle ground where communication overhead is reduced. The goal of design should to determine the right granularity for parallel tasks, while avoiding load imbalance and communication overhead to achieve the best performance.

Granularity is calculated in terms of computation time and communication time, or in terms of coarse, medium or fine granularity.


Granularity = Computation Time / Communication Time


This ratio is a measure of how much overhead is produced per unit of computation. A large granularity implies that the communication overhead is insignificant compared to computation time while a low ratio implies that the communication overhead dominates computation time, that which results in poor performance. Message-passing multiprocessors usually employ medium/coarse granularity. Fine granularity is used in dataflow systems. With coarse granularity, each process contains a large number of sequential instructions and takes a substantial time to execute. In fine granularity, a process might consist of a few instructions or just one instruction. As the granularity is reduced, the process communication overhead generally increases. It is particularly desirable to reduce the communication overhead because of the significant time for communication. The tendency to obtain maximum performance is to resort to finest possible granularity, which leads to highest degree of parallelism. But maximum parallelism should not lead to increasing overhead.


In message passing systems overhead can be reduced by mapping several processes onto one processor and switching from one process to another when a processes held up by message passing. The disadvantage of static assignment is that load sharing will be unclear before program is actually executed. Code and data have to be physically transferred to the local memory of each node prior to execution, and this transfer can cause a significant amount of overhead. Similarly results need to be transferred from the nodes to the host system. Clearly, the computation to be performed needs to be reasonably long to lessen the loading overhead.


Parallel Thread Granularity

Parallel thread granularity is a important factor affecting scalability of the parallel program. It affects the relative cost of thread management, which involves creation, task assignment, and synchronization. Larger the task granularity, the smaller thread management overheads are relative to the execution time of the tasks. One way to increase parallel task granularity is through choosing to parallelize the outer loop and not parallelize the inner loop. When there is poor load balancing we can consider parallelizing an inner loop rather than the outer loop. We can improve load balancing with the use of dynamic task to thread mapping scheme. Instead of assigning threads statically we can have dynamic assignment in which central task queue is up to keep tasks in the form of groups of loop iterations. During runtime, whenever a thread becomes idle it grabs a task. (Sollihin,2009:90)


Parallelism can be exploited at various processing levels. Five processing levels are as follows:

  1. Instruction Level
  2. Loop Level
  3. Procedure Level
  4. Subprogram Level
  5. Job Level


The execution of a program may involve a combination of these levels. Number of levels of parallelism dictates the granularity. The execution of a program may involve combination of these levels. The actual depends on application, formulation, algorithm, language, program, compilation support, and hardware limitations.


Instruction Level Parallelism

At instruction level parallelism a grain contains less than 20 instructions, called fine grain. Depending on individual programs, fine grain parallelism at this level may range from two to thousands. The advantage of fine grained computation lies in the abundance of parallelism. The exploitation of fine grained parallelism can be assisted by an optimizing compiler which should be able to automatically detect parallelism and translate the source code to a parallel form which can be recognized by the run-time system.( Patterson, 1995). Instruction-level parallelism is rather tedious for an ordinary programmer to detect in a source code. Synchronization interval should be less than 20. Synchronization at the instruction level by means of conditional wait statements that automatically resume when their condition becomes true.


Loop Level Parallelism

Loop level parallelism corresponds to the iterative loop operations. A typical loop contains less than 500 instructions. Some loop operations are independent in successive iterations, can be vectorized for a pipelined execution or for lock-step execution on SIMD machines. Some loop operations can be self-scheduled for parallel execution on MIMD machines. Loop level parallelism is most optimized program construct to execute on a parallel or vector computer. However, recursive loops are rather difficult to parallelize. Loop level is still considered a fine grain of computation. Synchronization interval is 20 to 200 (Patterson, 1995).


Procedure Level Parallelism

This level corresponds to medium grain size at the task, procedural, subroutine and co-routine levels. A typical grain contains less than 2000 instructions. Detection of parallelism at this level is much more difficult than at the finer-grain levels. The communication requirement is often less compared with that required in MIMD execution mode. SPMD execution mode is a special case at this level. Multitasking also belong in this category. Significant efforts by programmers may be needed to restructure a program at this level, and some compiler assistance is also needed. Synchronization interval is 200 to 2000 (Patterson, 1995).


Subprogram Level Parallelism

This corresponds to the level of job steps and related subprograms. The grain size may typically contain thousands of instructions. Job steps can overlap across different jobs. Subprograms can be scheduled for different processors in SPMD or MPMD mode, often in message passing multicomputer.


Multiprogramming on uniprocessors or on a multiprocessor is conducted at this level. In this past, parallelism at this level has been exploited by algorithm designers or programmers, rather than by compilers. We do not have good compilers for exploiting medium or coarse grain parallelism at present. Synchronization interval is 2000 to 1M (Patterson, 1995).


Job level parallelism

This corresponds to the parallel execution of essentially independent jobs on a parallel computer. The grain sizes can be as high as tens of thousands of instructions in a single program. For supercomputers with a small number of very powerful processors, such as coarse grain parallelism is practical. Job-level parallelism handled by the program loader and by the operating system in general. Time sharing and space sharing multiprocessors explore this level of parallelism. In fact, both time and space sharing are extensions of multiprogramming.


To summarize, fine grain parallelism is often exploited at instruction or loop levels, preferably assisted by a parallelizing or vectorizing compiler. Medium-grain parallelism at the task or job step demands significant roles for the programmer as well as compilers. Coarse-grain parallelism at the program level relies heavily on an effective OS and on the efficiency of the algorithm used. Shared variable communications is often used to support fine-grain and medium grain computations. Message passing multi-computers have been used for medium and coarse grain computations. In general, finer the grain size, the higher the potential for parallelism and the higher the communication and schedule overhead. Fine grain provides a higher degree if parallelism, but heavier communication overhead, as compared with coarse-grain computations. Massive parallelism is often explored at the fine-grained level, such as data parallelism in SIMD and MIMD computers. Synchronization interval is 2000 to 1M (Patterson, 1995).


To schedule parallel tasks in multiprocessors we need to use synchronization of processes. The use of multiprogramming on individual processors depends on the synchronization granularity. A access control mechanism is needed to maintain orderly access. Sequence control mechanism is needed to ensure the ordering of operations.


Processors also compete with each other to gain access to shared data items. Various synchronization techniques locks, test and set, barriers, point-to-point, critical section, semaphores and precedence are some methods used to achieve synchronization. Point-to-point synchronization is cheaper than locks and barrier.


Amount of time spent in locks and critical section increases with processors and this can increase overhead. Similarly simple spin-lock scheme will not scale well with large machines with all processors contending for the same lock. The bus that act as a point of serialization for all the processors will lead to lot of contention as well as traffic. The fairness property of bus makes things worse, since it delays the processor that claims lock from releasing it. The root cause of the problem is the contention and the fact that the lock access is serialized. The key advantage of spin and lock is low overhead when it is reused by same processor.

Parallel tasks can be synchronized through the use of synchronization points called barrier. No task may proceed beyond a barrier until all participating tasks have reached that barrier. Members of a group wait at a barrier until a specified number of group members check at the barrier. To implement barrier we use spin lock, which will suffer with the problems specified above. Frequently barrier is used within a loop, so that processes released from the barrier would do some work and then reach the barrier again. If one of the processes never leave barrier, this could happen if OS schedule another process. Now it is possible that one process races ahead of and gets to barrier again before the last process has left. All the processes are trapped now. Solution is of course to count number of processes entered and left. But this will add time and in turn latency of barrier and increase contention. So performance can be poor.


Parallel program is a collection of tasks. Associated with each program is a flow of control among the sub program units or tasks. Certain tasks have to be completed before others can be initiated. These tasks may run serially or in parallel. A optimal schedule determines the allocation of tasks to processors of multiprocessor system and the execution order of the tasks so as to achieve the shortest execution time.


Scheduling

Scheduling on a multiprocessor involves three interrelated issues. Firstly, the assignment of processes to processors, use of multiprogramming on individual processors and the actual dispatching of a process. Scheduling depends on degree of granularity and number of processors available. Scheduling techniques can be classified into two groups:


  1. Dynamic Scheduling
  2. Static Scheduling


In static scheduling , each task is allocated to a particular processor based in the analysis of the precedence constraints imposed by the tasks at hand. Each time the task is executed, it is allocated to that predetermined processor. Each task is allocated before hand so less overhead. Obviously, this method does not take into consideration the nondeterministic nature of tasks brought about by conditional branches and loops in the program. The target of the conditional branch and the upper bounds of the loops in the program. The target of the conditional branch and the upper branch of the loops are not known until the program execution begins. Thus, static scheduling is not always optimal. Since assignment is predetermined, static techniques do not incur much task management overhead at run time. To achieve good load balancing, they require that the relative amounts of work in different tasks be adequately predictable or that enough tasks exit to ensure a balanced distribution.


In dynamic scheduling, tasks are allocated to the processor based in the execution characteristics. Usually some load balancing is employed in determining optimal allocation. Because the scheduler has only the knowledge of local information about the program at any time, finding the global optimum is difficult. Another disadvantage is the increased overhead because the schedule has to be determined while the tasks are running. Another disadvantage is inefficient use of cache memory, and it is more difficult for the processors to communicate with dynamic scheduling.


Dynamic scheduling have lot of advantages too. It enables handling some cases when dependences are unknown at compile time. And it simplifies the compiler. It also allows code that was compiled with one pipeline in mind to run efficiently on a different pipeline. Although these advantages are gained at a cost of a significant increase in hardware complexity. In dynamic scheduling techniques usually provide good load balancing despite of unpredictability or environmental conditions, they make the management of parallelism more expensive.


Dynamic scheduled processor cannot remove true data dependences; it tries to avoid stalling when dependences are present. In contrast, static pipeline scheduling, like that we have already seen, tries to minimize stalls by separating dependent instructions so that they will not lead to hazards. Static scheduling can also be used on code destined to run on a processor with a dynamically scheduled pipeline.


There are two basic scheduling models used by operating systems in multiprocessor systems. These two systems differ essentially in the granularity of the units of work schedule as one entity. Two types of scheduling are:


  1. Process Scheduling
  2. Thread Scheduling


Process model scheduling is performed on a per process basis, smallest unit of work to be scheduled is a process. Process scheduling involves the declaration of distinct process states which are connected with scheduling, the specifications of the state transition diagram and the statement of a scheduling policy.


Thread scheduling is a fine grained scheduling model where a smaller unit of work is threads. Thread like process is a sequence of instructions. Threads are created within and belong to processes. All the threads created within one process share the resources of the process, in particular the address space. Thread scheduling is more affordable and advantageous than process model. With finer-grained entities more parallelism can be exposed than in the case of processes. In addition, the creation of threads and the necessary communication, synchronization and switching are far less expensive operations than these for processes, since all threads belonging to the same process share the same resources.


General approaches to thread scheduling are:

  1. Load Balancing
  2. Gang Scheduling
  3. Dedicated Processor Assignment
  4. Dynamic Scheduling


Load Balancing

The key objective for load balancing is to minimize idle time on threads. Sharing the workload equally across all threads with minimal work sharing overheads results in fewer cycles wasted with idle threads not advancing the computation, and thereby leads to improved performance. However, achieving perfect load balance is non-trivial, and it depends on the parallelism within the application, workload, the number of threads, load balancing policy, and the threading implementation. In load balancing processes are not assigned to a particular processor. A global queue of ready threads is maintained, and each processor, when idle, selects a thread from the queue. Load sharing is one of the most common schemes used in multiprocessors.


Load balancing is achieved by distributing load evenly across the processors, assuring that no processor is idle while work is available to do. There is no requirement of centralized scheduler. When a processor is available, the scheduling routine of the operating system is run on that processor to select the next thread. A global queue of ready threads is maintained for load balancing. It can be organized and accessed using any priority-based schemes and schemes that consider execution history or anticipated processing demands. Dynamic task mapping allows to achieve load balancing easily as compared to static load balancing.


Load Sharing uses a global queue, which is accessed by each processor. Therefore load sharing may need mutual exclusion so global queue is not accessed by more than one processor at a time. This may be a bottleneck when more than one processor looks for work at the same time especially in large machines. Preempted threads are unlikely to resume execution on the same processor. Also if all threads are in the global queue, all threads of a program will not gain access to the processors at the same time.


Gang Scheduling

Gang scheduling is a resource management scheme for parallel and distributed systems that combines time-sharing with space-sharing to ensure short response times for interactive tasks and high overall system throughput. In gang scheduling, a set of related threads is scheduled to run on a set of processors at the same time, on a one-to-one basis. Gang scheduling is useful for applications where performance severely degrades when any part of the application is not running, since threads may often need to synchronize with each other. It uses giant locks to separate the functions of subsystems, which can be continually expanded until subsystem is fully multithreaded to handle many lightweight processes simultaneously. In gang scheduling, independent schedulable threads are formed from an application supported by a user library or language directives. These threads are generally known as virtual processors. The OS kernel multiplexes the thread onto physical processors. Under gang scheduling, jobs with the same resource requirements are collected in groups. Each group is assigned to a partition of the system for a specific amount of time. When the time allocated to a group is up, a context switch occurs, resources are reallocated, and the jobs in new groups are scheduled to execute.


The overhead involved in the gang scheduling can be significant so processors are dedicated for periods longer than the typical time-sharing quantum to reduce this overhead. Processor allocation server is a facility for implementing an allocation policy. The server has direct control over the processors required by the applications. The available processors set is assigned to multiple threads dynamically. The application creates the processor set and a program will execute only threads belonging to the same processor set. Gang scheduling reduces synchronization blocking and require less time for resource allocation. In particular, gang scheduling promotes efficient fine grain interactions among threads in the same gang and provides interactive response time for short jobs.



References

Culler David E., Singh Jaswinder pal, Gupta Anoop 2000. Parallel Computer Architecture a Hardware / Software approach. Morgan Kaufmann Publishers, CA, USA.

Solhini Yan 2009. Fundamentals of Parallel Computer Architecture Multichip and Multicore systems. Student edition, Solihin Publishing and Consulting LLC, USA.

Hwang Kai 1993. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, Inc. USA.

Flynn Micheal J. 1995. Computer Architecture : Pipelined and Parallel Processor Design. Jones and Barlette Publishers, Inc. London, UK.

Patterson David A., Hannessy John L. 1995. Computer Architecture A Quantitative Approach. 2^nd^ Ed., Morgan Kaufmann Publishers, CA, USA.

Sima Deszo, Fountain Terence, Kacsuk Peter, 1997. Advanced Computer Architecture A Design Space Approach. Addison Wesley Longman, Essex, England.

Stone Harold S., 1993. High-Performance Computer Architecture. 3^rd^ Ed., Addison-Wesley Publishing Company Inc., MA, USA.

Abd-El-Barr Mostafa, El-Rewini Hesham, 2005. Advanced Computer Architecture and Parallel Processing. John Wiley & Sons, Inc. Hoboken, New Jersey, USA.

Briggs Faye A., Hwang Kai, 1984. Computer Architecture And Parallel Processing. McGraw-Hill, Inc. USA.

Wilkinson Barry, 1996. Computer Architecture Design and Performance. 2^nd^ Ed., Prentice Hall Europe, Englewood Cliffs, NJ.

Shiva Sajjan G. 1996. Pipelined and Parallel Computer Architectures. HarperCollins, NY, USA.

Sinapova Lydia, Operating Systems: Multiprocessor and Real-Time Scheduling. [Load|http://faculty.simpson.edu/lydia.sinapova/www/cmsc335/cmsc335-01/CH10-RT.htm%20%5C%20Load

Squillante Mark S., Wang Fang , Papaefthymio Marios. An Analysis of Gang Scheduling for Multiprogrammed Parallel Computing Environments. Yale University, New Haven, CT.

Gang Scheduling, Timesharing on Parallel Computers, SC98, November 1998 (a summary)[1]

Intel ISN, 2010. Granularity and Parallel Performance. [2]

Ding-Kai Chen, et al. The Impact of Synchronization and Granularity on Parallel Systems”, _Proceedings of the 17th Annual International Symposium on Computer Architecture 1990_, Seattle, Washington, USA.