CSC/ECE 506 Spring 2011/ch2a mc: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 69: Line 69:


{
{
   float *a_h, *a_d;  // Pointer to host & device arrays
   float *a_h, *a_d;  // Pointer to host & device arrays
   const int N = 10;  // Number of elements in arrays
   const int N = 10;  // Number of elements in arrays


Line 79: Line 77:


   cudaMalloc((void **) &a_d, size);  // Allocate array on device
   cudaMalloc((void **) &a_d, size);  // Allocate array on device
 


   // Initialize host array and copy it to CUDA device
   // Initialize host array and copy it to CUDA device

Revision as of 02:56, 11 February 2011

Introduction

The Graphical Processing Unit (GPU) is a dedicated super-threaded, massively data parallel co-processor. Unlike the CPU, GPUs are designed to be highly parallel, have a high computational throughput, and high memory throughput. CPUs are architected to perform well for single and multi-threaded applications. A number of programming APIs have been introduced to allow programmers to harness the power of the GPU to perform parallel tasks. Two such architectures are OpenCL and Compute Unified Device Architecture (CUDA).

OpenCL

Compute Unified Device Architecture (CUDA)

CUDA is one of the parallel architectures available to modern GPUs. CUDA a proprietary architecture developed by NVIDIA. CUDA was introduced with NVIDIA’s GeForce 8, February 2007, series of video cards. This architecture gives programmers access to the GPUs multicore processor for performing math intensive operations. These operations include physics modeling (PhysX), physical modeling, image processing, matrix algebra, etc. These GPUs are specifically design to perform many floating point and integer operations simultaneously. CUDA is capable of handling millions of threads simultaneously with little overhead to manage this large number of threads.

CUDA Architecture

Figure 1 shows the typical arrangement on for a GPU multiprocessor. This figure shows the general flow path of data through the GPU. Data flows from the host to the thread execution manager, which spawns and schedules the threads to each stream processor (SP). Each multi-processor, in this figure, contains eight stream processors. Each stream processor has its own memory, texture filter (TF). Each pair of processors has a shared L1 cache. Global memory is a shared memory is shared amongst all the stream processors.

CUDA Threads

In CUDA programming, serial operations are still handled by the host CPU (main processor) while parallelizable kernels are handed off to the GPU for processing. It is important to understand the layout of the CUDA architecture and memory. Figure 2 shows a simplified block diagram of a typical CUDA thread model

PICTURE HERE

Each kernel is assigned a grid. Each grid contains a number of blocks. Each block contains threads (512 maximum per block).


CUDA Programming

Programming using CUDA is accomplished via language extensions or wrappers. These extensions are available for a number of common programming langauages such as:

FORTRAN

Java

Ruby

Python

Perl

.NET

MATLAB

Mathematica

Coding using CUDA is fairly straightforward. Listing 1 shows a simple program that will square each value in a matrix.

#include "stdafx.h"


#include <stdio.h>

#include <cuda.h>


// Kernel that executes on the CUDA device

__global__ void square_array(float *a, int N)

{

 int idx = blockIdx.x * blockDim.x + threadIdx.x;
 if (idx<N) a[idx] = a[idx] * a[idx];

}


// main routine that executes on the host

int main(void)

{

 float *a_h, *a_d;  // Pointer to host & device arrays
 const int N = 10;  // Number of elements in arrays
 size_t size = N * sizeof(float);
 a_h = (float *)malloc(size);        // Allocate array on host
 cudaMalloc((void **) &a_d, size);   // Allocate array on device
 // Initialize host array and copy it to CUDA device
 for (int i=0; i<N; i++)
   a_h[i] = (float)i;


 cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);


 // Do calculation on device:
 int block_size = 4;
 int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);


 square_array <<< n_blocks, block_size >>> (a_d, N);


 // Retrieve result from device and store it in host array
 cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);


 // Print results
 for (int i=0; i<N; i++)
   printf("%d %f\n", i, a_h[i]);


 // Cleanup
 free(a_h);
 cudaFree(a_d);

}



References

http://blog.langly.org/2009/11/17/gpu-vs-cpu-cores/

http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/

http://en.wikipedia.org/wiki/CUDA

http://courses.engr.illinois.edu/ece498/al/