Chapter 2b: Data parallelism in GPUs: Difference between revisions
No edit summary |
No edit summary |
||
(8 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
== Basics of CUDA GPU == | == Basics of CUDA GPU == | ||
=== Architecture overview === | === Architecture overview === | ||
=== C Runtime overview === | === C Runtime overview === | ||
Line 37: | Line 36: | ||
</pre> | </pre> | ||
Hence the time taken to perform this simple computation is linear in the size of the input. | |||
This could manifest as a major bottleneck in real time applications such as 3D renderings, Face recognition, Monte Carlo simulation,etc where the input to be processed is embarrassingly large. | |||
== Solution == | == Solution == | ||
The pattern that exists in all of these problems is that, there's one instruction or a single set of instructions that act on a large amount of distinct data elements (SIMD - link). This is a typical pattern where data parallelism can be applied. | |||
In a two processor (or two core) system, we could have both the processing units (cores) work on a part of the array simultaneously. | |||
Following snippet is a rough outline of how this could be achieved: | |||
<pre> | |||
..... | |||
if CPU = "a" | |||
lower_limit := 0 | |||
upper_limit := round(A.length/2) | |||
else if CPU = "b" | |||
lower_limit := round(A.length/2) + 1 | |||
upper_limit := A.length | |||
for i from lower_limit to upper_limit by 1 | |||
C[i] = A[i] + B[i]; | |||
..... | |||
</pre> | |||
This is typically done in high level programming languages such as C using the Thread(-link-) abstraction, where each thread runs simultaneously on a different processing unit. Even though this solution is much faster, it is still limited to just two parallel executions. | |||
=== CUDA solution === | === CUDA solution === | ||
Because the same program (set of instructions) is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements, it exhibits high arithmetic intensity. GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. Hence GPU (more specifically CUDA) would be the best answer to such a problem. | |||
Similar to what we did with the two processor, multi-threaded solution, CUDA enables us to create thousands of threads and run them parallely on thousands of Processing units that GPU provides. Following C program illustrates how this can be achieved: | |||
<pre> | |||
__global__ void VecAdd(float* A, float* B, float* C) | |||
{ | |||
int i = threadIdx.x; | |||
C[i] = A[i] + B[i]; | |||
} | |||
int main() | |||
{ | |||
... | |||
// Invocation with N threads | |||
VecAdd<<<1, 1000>>>(A, B, C); | |||
... | |||
} | |||
</pre> | |||
In this code, the main function simply calls a specialized routine (called Kernel (-link-)), and passes in the 3 arrays that need to be processed parallely. | |||
It also informs the CUDA runtime (-link-) to create 1000 threads, where each thread runs the same VecAdd routine, but on different data. As can be seen from the code, in the <code>VecAdd</code> routine, we are finding the thread index and acting only on that index of the input arrays. Hence each of the 1000 threads acts on a different location and performs the computation. This helps us achieve exponential performance improvements, which becomes evident when dealing with super-parallel applications such as those mentioned above. | |||
== Example2 illustrating more CUDA features == | |||
== Other well known Applications == | |||
Password guessing | |||
3D rendering,etc | |||
http://www.infoworld.com/d/developer-world/gpu-computing-about-massive-data-parallelism-263 |
Latest revision as of 01:20, 31 January 2012
Take a modern GPU architecture, and use it as an example in explaining how data-parallel programming is done. Do this in a discussion similar to the discussion of the hypothetical array processor in Lecture 3. That is, describe the problem, then describe the instructions of the GPU, and show code for how the problem can be solved efficiently using GPU instructions. You might want to use multiple examples to illustrate different facilities of a GPU instruction set.
Introduction
Terminology
Basics of CUDA GPU
Architecture overview
C Runtime overview
Problem
Consider the problem of computing addition of two vectors and storing the result into a third vector. A standard C program to achieve this would look like this:
int main() { float A[1000],B[1000],C[1000]; ..... //Some initializations ..... for(int i = 0; i < 1000; i++) C[i] = A[i] + B[i]; ...... }
Here we are executing the for
loop, once for each array index. On a sequential processor (Single core processor),the instructions similar to the following are executed for each run of the for
loop.
Fetch i Fetch A[i] into Register EAX Fetch B[i] into Register EBX Add EBX to EAX Store EAX into Adress of C[i]
Hence the time taken to perform this simple computation is linear in the size of the input. This could manifest as a major bottleneck in real time applications such as 3D renderings, Face recognition, Monte Carlo simulation,etc where the input to be processed is embarrassingly large.
Solution
The pattern that exists in all of these problems is that, there's one instruction or a single set of instructions that act on a large amount of distinct data elements (SIMD - link). This is a typical pattern where data parallelism can be applied.
In a two processor (or two core) system, we could have both the processing units (cores) work on a part of the array simultaneously. Following snippet is a rough outline of how this could be achieved:
..... if CPU = "a" lower_limit := 0 upper_limit := round(A.length/2) else if CPU = "b" lower_limit := round(A.length/2) + 1 upper_limit := A.length for i from lower_limit to upper_limit by 1 C[i] = A[i] + B[i]; .....
This is typically done in high level programming languages such as C using the Thread(-link-) abstraction, where each thread runs simultaneously on a different processing unit. Even though this solution is much faster, it is still limited to just two parallel executions.
CUDA solution
Because the same program (set of instructions) is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements, it exhibits high arithmetic intensity. GPU is specialized for compute-intensive, highly parallel computation – exactly what graphics rendering is about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. Hence GPU (more specifically CUDA) would be the best answer to such a problem.
Similar to what we did with the two processor, multi-threaded solution, CUDA enables us to create thousands of threads and run them parallely on thousands of Processing units that GPU provides. Following C program illustrates how this can be achieved:
__global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... // Invocation with N threads VecAdd<<<1, 1000>>>(A, B, C); ... }
In this code, the main function simply calls a specialized routine (called Kernel (-link-)), and passes in the 3 arrays that need to be processed parallely.
It also informs the CUDA runtime (-link-) to create 1000 threads, where each thread runs the same VecAdd routine, but on different data. As can be seen from the code, in the VecAdd
routine, we are finding the thread index and acting only on that index of the input arrays. Hence each of the 1000 threads acts on a different location and performs the computation. This helps us achieve exponential performance improvements, which becomes evident when dealing with super-parallel applications such as those mentioned above.
Example2 illustrating more CUDA features
Other well known Applications
Password guessing 3D rendering,etc http://www.infoworld.com/d/developer-world/gpu-computing-about-massive-data-parallelism-263