CSC/ECE 506 Spring 2014/4a ad

From Expertiza_Wiki
Jump to navigation Jump to search

Writeup page

Automatic Parallelization

Auto parallelization is the technique of automatically converting a sequential code into multi-threaded or vectorized (or even both) code. Therefore one can take advantage of a shared memory multi processor environment to run the multi threaded code and obtain a better performance. This also serves the goal of relieving programmers from the tedious and error-prone manual parallelization process.

Introduction

With increasing transistor densities, multi-core computing systems are being touted as the most popular means of delivering performance. Unless the application is parallelized efficiently, one cannot realize the potential of multi core environment. It is a challenge to efficiently parallelize sequential programs without any errors. It is widely accepted that manual parallelization of code, by expert programmers, leads to the most streamlined parallel implementation. But it is also a costly and time consuming process. If one can parallelize the compiler technology to itself while ensuring the formal correctness of the resulting parallel code, time-to-market and cost can be greatly reduced.

Approaches

Abhishek

Generally, a high fraction of a program’s execution time takes place in one form of loop or the other. So the initial emphasis of auto parallelization was on loop level parallelization.There are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-threading.

Cyclic multithreading

A parallelizing compiler using cyclic multithreading tries to split a loop in a such a way that each iteration of a loop can be executed concurrently on a different processor.Before actually parallelizing the code, the compiler does two passes analyzing the following in each pass the code respectively.

  • a) How safe is it to parallelize the loop?

Accurate dependency analysis and alias analysis is required to answer the above question. The compiler needs to first determine whether each iteration of the loop can be executed independently of others. Some forms of data dependencies can be addressed but they might incur additional overhead in the form of message passing or synchronization of shared memory or some other method of processor communication.

  • b) Can we gain an appreciable performance enhancement if we parallelize the loop?

In order to estimate it, the compiler compares the theoretical execution time of code after parallelization to the sequential execution time. This has to be done because sometimes the overhead of parallelization and execution over multiple processors might dominate the actual enhancement obtained due to parallelization. In such cases the actual speedup will be less than the potential speedup.

Example:

        for(i=0;i<n;i++)
	    z[i] = x[i] + y[i];

The above code exhibits DOALL parallelism because each iteration is independent of the others, and the final result of z[] will not depend on the execution order of the other iterations. Therefore it can be auto parallelized by the compiler because each iteration is independent of the others, and the final result of array z will be correct regardless of the execution order of the other iterations.

        for(i=2;i<n;i++)
	    z[i] = z[i-1]*2;

The above for loop cannot be auto-parallelized because z[i] depends on the previous iteration value z[i-1].But it doesn’t mean that it cannot be parallelized. The above loop is equal to the following

        for(i=2;i<n;i++)
	    z[i] = z[1]*2^(i-1);

But current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and one cannot say for sure that this piece of code will benefit from parallelization.

Pipelined multithreading

A pipelined multi-threading parallelizing compiler attempts to break up the sequence of operations inside a loop into a series of code block. Each of these code blocks can be executed by separate processors concurrently. The compiler could assign each of the code blocks to different processors and insert appropriate code to forward the output of one processor to another.

Example:

        for(i=1;i<=n;i++){
	    x[i] = x[i-1] + y[i];
	    z[i] = z[i] + x[i];
        }

The compiler can split the above loop into two loops(each can be executed on a different processor) and embed special instructions to forward output of one processor to another.

Loop 1:
        for(i=1;i<=n;i++){
	    x[i] = x[i-1] + y[i];
	    send(x[i]);
        }

Loop 2:
        for(i=1;i<=n;i++){
            receive(x[i]);
	    z[i] = z[i] + x[i];
        }

So loop 1 can be run on one processor and the result sent to the second processor which executes loop 2.

Dipali

Examples of Auto Parallel Compilers

SUIF compiler SUIF (Stanford University Intermediate Format) is a free infrastructure designed to support collaborative research in optimizing and parallelizing compilers Vienna Fortran compiler

It is a new source-to-source parallelization system for HPF+ (optimized version of HPF), which addresses the requirements of irregular applications.

PLUTO PLUTO is an automatic parallelization tool based on the polyhedral model. The polyhedral model for compiler optimization is a representation for programs that makes it convenient to perform high-level transformations such as loop nest optimizations and loop parallelization Par4All Par4All is an automatic parallelizing and optimizing compiler (workbench) for C and Fortran sequential programs Omni OpenMP Compiler It translates C and Fortran programs with OpenMP pragmas into C code suitable for compiling with a native compiler linked with the Omni OpenMP runtime library

Limitations

References