CSC/ECE 506 Spring 2014/4a ad: Difference between revisions
Line 62: | Line 62: | ||
<pre> | <pre> | ||
Loop 1 | Loop 1: | ||
for(i=1;i<=n;i++){ | for(i=1;i<=n;i++){ | ||
x[i] = x[i-1] + y[i]; | x[i] = x[i-1] + y[i]; | ||
Line 68: | Line 68: | ||
} | } | ||
Loop 2 | Loop 2: | ||
for(i=1;i<=n;i++){ | for(i=1;i<=n;i++){ | ||
receive(x[i]); | receive(x[i]); |
Revision as of 22:37, 25 February 2014
Automatic Parallelization
Auto parallelization is the technique of automatically converting a sequential code into multi-threaded or vectorized (or even both) code. Therefore one can take advantage of a shared memory multi processor environment to run the multi threaded code and obtain a better performance. This also serves the goal of relieving programmers from the tedious and error-prone manual parallelization process.
Introduction
With increasing transistor densities, multi-core computing systems are being touted as the most popular means of delivering performance. Unless the application is parallelized efficiently, one cannot realize the potential of multi core environment. It is a challenge to efficiently parallelize sequential programs without any errors. It is widely accepted that manual parallelization of code, by expert programmers, leads to the most streamlined parallel implementation. But it is also a costly and time consuming process. If one can parallelize the compiler technology to itself while ensuring the formal correctness of the resulting parallel code, time-to-market and cost can be greatly reduced.
Approaches
Abhishek
Generally, a high fraction of a program’s execution time takes place in one form of loop or the other. So the initial emphasis of auto parallelization was on loop level parallelization.There are two main approaches to parallelization of loops: pipelined multi-threading and cyclic multi-threading.
Cyclic multithreading
A parallelizing compiler using cyclic multithreading tries to split a loop in a such a way that each iteration of a loop can be executed concurrently on a different processor.Before actually parallelizing the code, the compiler does two passes analyzing the following in each pass the code respectively.
- a) How safe is it to parallelize the loop?
Accurate dependency analysis and alias analysis is required to answer the above question. The compiler needs to first determine whether each iteration of the loop can be executed independently of others. Some forms of data dependencies can be addressed but they might incur additional overhead in the form of message passing or synchronization of shared memory or some other method of processor communication.
- b) Can we gain an appreciable performance enhancement if we parallelize the loop?
In order to estimate it, the compiler compares the theoretical execution time of code after parallelization to the sequential execution time. This has to be done because sometimes the overhead of parallelization and execution over multiple processors might dominate the actual enhancement obtained due to parallelization. In such cases the actual speedup will be less than the potential speedup.
Example:
for(i=0;i<n;i++) z[i] = x[i] + y[i];
The above code exhibits DOALL parallelism because each iteration is independent of the others, and the final result of z[] will not depend on the execution order of the other iterations. Therefore it can be auto parallelized by the compiler because each iteration is independent of the others, and the final result of array z will be correct regardless of the execution order of the other iterations.
for(i=2;i<n;i++) z[i] = z[i-1]*2;
The above for loop cannot be auto-parallelized because z[i] depends on the previous iteration value z[i-1].But it doesn’t mean that it cannot be parallelized. The above loop is equal to the following
for(i=2;i<n;i++) z[i] = z[1]*2^(i-1);
But current parallelizing compilers are not usually capable of bringing out these parallelisms automatically, and one cannot say for sure that this piece of code will benefit from parallelization.
Pipelined multithreading
A pipelined multi-threading parallelizing compiler attempts to break up the sequence of operations inside a loop into a series of code block. Each of these code blocks can be executed by separate processors concurrently. The compiler could assign each of the code blocks to different processors and insert appropriate code to forward the output of one processor to another.
Example:
for(i=1;i<=n;i++){ x[i] = x[i-1] + y[i]; z[i] = z[i] + x[i]; }
The compiler can split the above loop into two loops(each can be executed on a different processor) and embed special instructions to forward output of one processor to another.
Loop 1: for(i=1;i<=n;i++){ x[i] = x[i-1] + y[i]; send(x[i]); } Loop 2: for(i=1;i<=n;i++){ receive(x[i]); z[i] = z[i] + x[i]; }