Chapter 2b: Data parallelism in GPUs

From Expertiza_Wiki
Revision as of 00:47, 31 January 2012 by Capsang (talk | contribs)
Jump to navigation Jump to search

Take a modern GPU architecture, and use it as an example in explaining how data-parallel programming is done. Do this in a discussion similar to the discussion of the hypothetical array processor in Lecture 3. That is, describe the problem, then describe the instructions of the GPU, and show code for how the problem can be solved efficiently using GPU instructions. You might want to use multiple examples to illustrate different facilities of a GPU instruction set.

Introduction

Terminology

Basics of CUDA GPU

Architecture overview

Instruction set overview

C Runtime overview

Problem

Consider the problem of computing addition of two vectors and storing the result into a third vector. A standard C program to achieve this would look like this:

int main()
{
	float A[1000],B[1000],C[1000];
	.....	
	//Some initializations
	.....

	for(int i = 0; i < 1000; i++)
		C[i] = A[i] + B[i];

	......
}

Here we are executing the for loop, once for each array index. On a sequential processor (Single core processor),the instructions similar to the following are executed for each run of the for loop.

Fetch i
Fetch A[i] into Register EAX
Fetch B[i] into Register EBX
Add EBX to EAX
Store EAX into Adress of C[i]

Hence the time taken to perform this simple computation is linear in the size of the input. This could manifest as a major bottleneck in real time applications such as 3D renderings, Face recognition, Monte Carlo simulation,etc where the input to be processed is embarrassingly large.

Solution

The pattern that exists in all of this problem is that, there's one instruction or a single set of instructions that act on a large amount of distinct data elements (SIMD - link). This is a typical pattern where data parallelism can be applied.

In a two processor (or two core) system, we could have both the processing units (cores) work on a part of the array simultaneously. Following snippet is a rough outline of how this could be achieved:

.....
    if CPU = "a"
       lower_limit := 0
       upper_limit := round(A.length/2)
    else if CPU = "b"
       lower_limit := round(A.length/2) + 1
       upper_limit := A.length

for i from lower_limit to upper_limit by 1
   C[i] = A[i] + B[i];
.....

This is typically done in high level programming languages such as C using the Thread(-link-) abstraction, where each thread runs simultaneously on a different processing unit.

CUDA solution

Other well known Applications

Password guessing 3D rendering,etc http://www.infoworld.com/d/developer-world/gpu-computing-about-massive-data-parallelism-263