ECE506 CSC/ECE 506 Spring 2012/2x az: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
(Created initial revision)
 
(Attempted to correct some formatting)
Line 2: Line 2:
== Data Parallel Programming with the AMD 6900 Series Graphics Processing Unit ==
== Data Parallel Programming with the AMD 6900 Series Graphics Processing Unit ==


----
'''Please excuse the formatting! I am working on correcting this.'''


Graphics processing units (GPUs) are parallel architectures designed to perform fast computations for graphics applications, such as rasterizing, per pixel lighting calculations, and tessellation.  Since their introduction, significant effort has been made in enabling general purpose computations such as fluid simulations, rigid and soft-body physics simulations, and particle simulations.
Graphics processing units (GPUs) are parallel architectures designed to perform fast computations for graphics applications, such as rasterizing, per pixel lighting calculations, and tessellation.  Since their introduction, significant effort has been made in enabling general purpose computations such as fluid simulations, rigid and soft-body physics simulations, and particle simulations.

Revision as of 06:53, 30 January 2012

Data Parallel Programming with the AMD 6900 Series Graphics Processing Unit

Please excuse the formatting! I am working on correcting this.

Graphics processing units (GPUs) are parallel architectures designed to perform fast computations for graphics applications, such as rasterizing, per pixel lighting calculations, and tessellation. Since their introduction, significant effort has been made in enabling general purpose computations such as fluid simulations, rigid and soft-body physics simulations, and particle simulations.

Reasons that GPUs are being tapped for general purpose computations include performance, cost, scalability, and ubiquity.

Modern GPU architectures are SIMD vector processors that can realize speedup in the linear algebra calculations involved in graphics and general purpose scientific and engineering applications.

This article presents an introduction to data-parallel programming on GPUs through exploration of the programming model, the architecture and instruction set of the AMD HD 6900 series GPUs, and finally an example involving vertex translation, rotation, and perspective projection transformations [4].

The examples are designed to explore different facets of the architecture by building upon one another, and include discussion of the AMD HD 6900 instruction set and OpenGL ES 2.0 vertex and fragment shader programs. OpenGL ES 2.0 is a subset of the desktop OpenGL 4.2, thus the examples can be applied to mobile and desktop GPUs [5].

Architectural Overview

The AMD HD 6900 series, AKA Cayman, a member of the Northern Islands-family of graphics processors, is a parallel architecture designed to interface with it’s own local device memory, the host system memory, and a host application. A block diagram of the processor architecture and data exchanges with the device and host are shown in Fig. 1.

The processor consists of a memory controller for interfacing with device memory and the host memory, a command processor which parses commands placed in system memory by the host and shared by the GPU, an array of data-parallel processors (DDP) configured as compute pipelines with their own local shared memory, a combined shared memory for all pipelines, an L1 input cache for the entire DPP array, an L2 input cache, an output cache, a dispatch processor to issue commands to specific pipelines and DPPs, a set of program counters for the DPP array, and an instruction and constant cache. [1]

FIGURE 1. AMD HD 6900 Series Block Diagram [1]

Each compute pipeline is a SIMD engine with it’s own local data, called a local data share (LDS). Each SIMD engine is made up of a number of thread processors (Streaming Processor Units, or SPUs). Each SPU contains 4 stream cores, one branch execution unit, and general purpose registers (GPRs). Each of these stream cores are simple ALUs, chosen as such to be able to operate on r,g,b,a values for pixels, or x,y,z,w coordinates for vectors. Due to the 4 stream cores in each thread processor, the SIMD instructions are known as VLIW4, or very large instruction word with 4 elements. [11]

Interestingly, the AMD HD 6800 series SPU (shown in Fig. 2) contained 5 stream cores, where the fifth was a transcendental core (sine, cosine, etc.) that enabled common per vertex calculations, such as 4 component dot products and a transcendental scalar for lighting, simultaneously. The t-core was introduced by AMD for DirectX9 to allow vertex shaders to improve performance when executing lighting calculations with the per vertex operations. As DX9 became scarce and DX10 and DX11 became more ubiquitous (through Windows Vista and Windows 7), AMD found that the number of stream cores being utilized on average was 3.4, indicating that in many cases the t-core was not being utilized. This, combined with the increase in general purpose GPU computing without the need for the t-core, led to the decision to drop the t-core, convert to a VLIW4 instruction set, and allowed for more SPUs to be included on the same size die. This positions AMD to increase utilization in their hardware in games and general purpose computing applications. [12]


FIGURE 2. SIMD Engines and Steam Processing Units [11]

Instruction and data flow - hardware perspective

A basic overview is given describing the flow of instructions and data to the GPU and to the processing elements, then from the processing elements to the greater GPU to the display, and if necessary, back to the host. The host places commands into the memory-mapped GPU registers in the system memory. These commands can include requesting the GPU to copy data from the system memory to device memory by message passing.

The command processors reads these instructions and data through the memory controller from system memory. Once the command is complete, a hardware generated interrupt is sent to the host from the command processor [1].

The device memory is updated with instruction and data from system memory through the memory controller via message passing from the host.

The dispatch processor requests a pipeline compute unit to execute an instruction thread (a kernel) by passing it the location of the kernel in device memory, an identifying pair, and a conditional value [1]. The dispatch processor also attempts to balance the load of the compute units by leveling loading the threads. [6]

The pipeline compute unit loads the instructions and data from the instruction and data cache (originally from device memory), and executes the kernel from beginning to end. Data and results can be input from and written to device memory through the memory controller and data caches. It can also “load data from off-chip memory into on-chip general-purpose registers (GPRs) and caches” [1]. Computation and writes to memory can be executed conditionally, allowing for parallel algorithms to be used on conditions per data element. Additionally, each compute unit contains its own sequencer to optimize it’s performance by rearranging instructions where possible [10].


Programming Model

Both task and data-parallel programming models are applicable to modern GPUs, but this article focuses on the data-parallel model.

The intended usage of the architecture is to provide batch instructions to the GPU through kernels (implemented as shader programs in OpenGL or compute programs in OpenCL) or load/store/execute/state configuration instructions (through OpenGL or OpenCL API commands) that are placed into memory on the host which is shared with the GPU through memory mapping. Additionally, host programs and kernel programs can be written for general purpose computing applications in language extensions such as nVidia’s CUDA [7], OpenGL and the shader language GLSL [8], and OpenCL [9].

The GPU then uses message passing from the driver application to localize the kernels and command instructions into device memory and cache for execution. The data upon which these kernels operate is also provided to the GPU through message passing via load/store commands in the shared memory space, and localized in the device memory.

The instructions and data on board the processor are streamed through instruction and data caches, and outputs written to output caches. The output caches normally place data back into device memory from which rasterization can be performed (in the case of vertex shaders, geometry shaders, or hull shaders), or, further in the rendering pipeline, back into framebuffers in the device memory for display (in the case of pixel shaders) after being placed into video RAM (VRAM). General purpose compute applications would output into device memory (to avoid PCIe bus transfer overhead for multi-stage calculations) or system memory directly.

Data-parallel programming with OpenGL

OpenGL is an industry standard graphics API, with support built into drivers for almost all graphics hardware (GPUs) and some CPUs directly. Newer versions of OpenGL support a programmable graphics pipeline designed to exploit the parallel architectures available in modern GPUs. This discussion is limited to the new programmable pipeline available in OpenGL 4.2, of which OpenGL ES 2.0 (supported by iOS, Android, Playstation 3, etc.) is a subset.

The programmable pipeline allows data to be placed into the graphics device memory, and instructs the GPU (through shader programs) to operate on the data in memory to perform the geometric transforms (translation, rotation, perspective projection, etc.), lighting transforms (vertex lighting, per pixel lighting, etc.) and any other desired calculations on data in device memory before it is placed into VRAM for display or put back in system memory. The flexibility of the programmable pipeline is powerful, making it a backbone of the support for other general purpose compute APIs, such as OpenCL. Fig. 3 illustrates the programmable pipeline.


FIGURE 3. OpenGL 4.2 Programmable Rendering Pipeline [14]

The programmable pipeline is in contrast with the fixed-function pipelines of the past in which the graphics processor always performed the same steps in order, and data had to be passed to the OpenGL state machine in the correct sequence. The output was the framebuffer, which was then put into VRAM for displaying. Any general purpose calculations were difficult to contrive into the pipeline, and framebuffer to system memory copy time was very slow, so it was not effective. Fig. 4 illustrates a fixed function pipeline.

FIGURE 4. OpenGL 1.1 Fixed Function Rendering Pipeline [13]

In the programmable pipeline, data parallel programming is done by creating vertex and fragment (previously called pixel) shader programs. A vertex shader program is executed on all vertex elements in an array containing vertex data, and can thus be executed as kernels in parallel across multiple SIMD engines, where each SPU operates on an individual vertex (size 1, 2, 3, or 4 floating point elements). Similarly, each fragment shader is executed on all pixels in an array containing pixel data (possibly interleaved with vertex data), and are also executed as kernels in parallel across multiple SIMD engines, where each SPU operates on an individual pixel (1, 2, 3, or 4 floating point elements). The packing, interleaving, and aligning of data is left to the programmer, but must be considered carefully to reduce cache misses and ensure optimal performance on fetches. As mentioned before, conditional computations can be performed in shaders.

For general purpose compute applications, fragment shaders and vertex shaders can be used on any arbitrary data arrays that are placed in device memory.

In general, algorithms for data-parallel processing contain inner and outer loops. The inner loops, which operate on atomic elements in the data sets and perform the calculations to be vectorized, occur in the kernels, are processed by the GPU, and executed in the DPP array compute units. The outer loops, which traverse the data sets to identify the elements, occur in the host programs. OpenGL distinguishes the host and device as server and client.

OpenGL’s programmable pipeline abstracts the traversal of data from the programmer, leaving the programmer to load the data to the client and implement the inner loops through shader programs while the outer loops are performed automatically.

Shaders for modern GPUs in OpenGL programs are written using GLSL syntax, which is a C like language that is then compiled into instructions for the vendor specific GPU instruction set by means of a shader compiler that is built into the GPU drivers. This process, and a fictional example of a shader having been compiled, will be shown in the example.

Data is loaded into device memory through the use of vertex buffer objects or vertex array objects.

Vertex buffers are arrays that store vertex data, such as coordinates, texture coordinates, normal vectors (for lighting), and possibly interleaved vertex color information. The application programmer provides OpenGL with a pointer to the array in device memory, assigns it a unique id, and then requests OpenGL to enable the array for that unique id. Finally, OpenGL is instructed to draw the vertex data, and is given info on how to treat the vertices (as triangle strips, as points, etc.). This method is not preferred, however, since the programmer effectively controls where the data is stored in system memory and indirectly requests when to copy data from system to device memory through the draw function. The alternative method is to request that OpenGL generate a buffer in device memory, assign it a unique id itself, and have the programmer pass the vertex data into it once. Then, as it is requested to be drawn, the data is bound from device memory and drawn without the overhead of the system to device transfer. [15].

The preferred alternative to vertex buffer objects is vertex array objects. Almost identical, the difference between the two is that the state changes requested of OpenGL (number of elements per vertex in the vertex buffer, the stride for vertices in the array, etc.) that are executed prior to each draw call with vertex buffer objects are stored within the array in vertex array objects. Thus, drawing a vertex array doesn’t require that the device incur the performance penalty of reconfiguring the pipeline each time. [15]

Example - Translation, Rotation, and Perspective Projection Transformation of Vertices

This example covers the transformation of vertices in a scene using OpenGL with GLSL, and will be used to illustrate: passing arrays of data to the device memory, passing a kernels of per-element (per vertex) instructions to the device memory, parallel execution of a kernel on a data array (vertices)

In a 3d scene, vertices exist as (x,y,z,w) elements in relation to a coordinate system. In graphics, the method used to position the camera, from which the scene is rendered, is to create a “model view” matrix and transform all the vertices in the scene. Next, the vertices are generally transformed by a perspective projection transformation matrix in order to add perspective into the scene.

The modelview matrix encodes the translation and rotation of the vertices in the scene relative to the position and view direction of the camera. For this example, the modelview matrix will be assumed to be the identity matrix, implying that the camera is facing the default direction in OpenGL coordinate space (towards the negative z direction), with it’s “up” vector pointing to positive y, and it is expecting to be able to view vertices that exist in front of it with at least negative z coordinates.

The perspective projection transformation matrix defines the volume, known as a frustum, that determines the field of view, the “near” plane (closest point to the camera which will be rendered), the “far” plane (furthest point which will be rendered), and the aspect ratio of the frustum (ratio of width to height). Figures 5 and 6 show the frustum with the coordinate system and the perspective projection matrix. For the purposes of this example, the perspective projection matrix will be assumed to be calculated separately.


FIGURE 5. OpenGL Coordinate System and Frustum [16]


FIGURE 6. Perspective Projection Matrix [16]


The transformation for each vertex is a matrix multiplication between the perspective projection matrix A and the vertex v (for N vertices), with the result stored in vertex p (for N vertices). The modelview matrix is not included in the calculation because it is assumed to be the identity matrix. The calculation is shown below:

p[n][i]:=j=03aijvnj

Pseudocode

Pseudo code to perform this calculation on the (0 <= n <= N-1) vertex is shown below.

a:= fetch(projectionMatrix); for i:= 0 to 3 do

 p[n][i]:= 0;
 for j:= 0 to 3 do
   p[n][i]:= p[n][i] + a[i][j] * v[n][j];
 end;

end;

Since the GPU is a vector processor with each SIMD engine containing multiple SPUs that can handle 4 element dot products on two vectors on multiple vector simultaneously, the pseudocode can be rewritten to closely resemble a vertex shader program on the (0 <= n <= N-1) vertex:

a:= LDS_fetch(projectionMatrix); // fetch from LDS of local SIMD engine p[n][i]:= 0, (0 <= i <= 3); p[n][i]:= p[n][i] + a[i][j] * v[n][j],

 (0 <= i <= 3) and (0 <= j <= 3) and (0 <= n <= N-1);

The vertex program can simultaneously calculate each element of the resultant vertex simultaneously, given enough parallel SPUs exist in the unutilizied SIMD engines. Further, given enough free SPU and SIMD engines exist, all elements of each transformed vertex can be computed simultaneously (but must be stored at the correct offset in memory to form vertex data arrays).

To accomplish the pseudocode above in GLSL, GLSL operations must exist that can: access the model-view-projection matrix from device memory perform vector dot products store results into an output buffer


The GLSL code required for this vertex shader is given below:

// // Shader.vsh // // Created by Adnan Zafar on 1/29/11. //

attribute vec4 v; // vertex v[n] uniform mat4 A; // model-view-projection matrix A

void main() {

   gl_Position = A * v;		// p[n] = A x v

}

The vertex coordinates v are passed in as an ‘attribute’, which is an indexed element of the (bound and enabled) vertex array or vertex buffer object. The model-view-projection matrix A is loaded by the host program into device memory as a ‘uniform’, instructing the vertex program compiler that it should be treated as a constant across all vertices.

The GPU driver compiles the shader program and produces code in the AMD HD 6900 series instruction set. To accomplish this efficiently on this architecture, instructions must exist that: load data from the input cache into GPRs perform a 4 element vector dot product in one instruction and store into a GPR move the result from GPRs to the output buffers


Instruction Set

The instruction set for the AMD HD 6900 series GPU is broken into subsets for control flow and clause types. Clauses are homogeneous groups of instructions including ALU, fetch through texture cache (or memory), and global data share instructions. [1]

Each set of instructions that are grouped into a program have limitations. For example, a control flow program can be no larger than 2^28 bytes, with a maximum of 128 slots for ALU clauses taking 256 dwords, and 16 slots for texture cache and global data share clauses taking 64 dwords. Instructions in one clause are executed serially, but if the clauses themselves have no dependencies, then groups of clauses can be executed in parallel. [1]

Since instructions can be up to the size of double words, then they are encoded in microcode, and specific instructions utilize one or two words for the necessary parameters. A full description of the microcode formats can be found in [1]. A set of numbers after a clause, such as CF_WORD[0,1], indicate the number of words in the microcode. For alignment purposes, these are usually powers of 2, with unused words set to 0.

The instructions relevant to this example are fetch through texture cache and ALU instructions. The optimization of clauses is outside the scope of this example, as is the efficient loading of data and the vertex shader program from system memory into device memory.

Memory Access

Assuming that the matrix A, the vertex array, and the program are in device memory, then instructions must be used to load the data into the GPRs of each SPU.

The instruction used to load vertex data from device memory into the GPRs for the SPU are done as texture cache (TC) clauses. Although it sounds non-intuitive, the TC clauses are used to read both pixel and vertex data, since the vertex cache was removed from the HD 6900 series. The TC clauses are initiated with the TC instruction and the vertex fetch is done with the FETCH microcode, repeated over each necessary fetch for the vertices and rows of the model-view-projection matrix. [1]

An EXPORT instruction is used to push the data from the GPRs after vertex shader execution back into the position buffer for use in the remainder of the programmable pipeline. Once the last vertex clause has completed, the EXPORT_DONE instruction should be issued to signal to the hardware that the position buffer should be finalized. Alternatively, the data could be pushed to another buffer and back to device memory is being used for general purpose compute applications. [1]

ALU Instructions

The DOT4 instruction can be utilized to perform a 4 element dot product between two vectors stored in GPRs. To accomplish this, the vertices and rows of each matrix must be loaded into GPRs. The results of the DOT4 instruction is placed as a scalar into the previous vector (PV) register in the x element, thus afterwards, it is required that the scalar be moved to a GPR prior to the end of the ALU clause. [1]

Shader Microcode

An example of abbreviated microcode for reading the vertex and model-view-projection matrix data into GPRs, calculating per vertex values, and writing back to a position buffer is given below: [1] Order Instruction and parameter summary Microcode format Description 1 LOOP_START [...] CF_WORD[0,1] Initiate loop through all A matrix rows 2 TC [...] CF_WORD[0,1] Initiate a texture cache clause to read all A matrix rows into GPRs 3,4,5,6 (per row) FETCH [VC_INST_FETCH | VTX_FETCH_VERTEX_DATA | 0 | 10 | GPRn | 0 | [0,1,2,3] | [0,1,2,3] | 0 | 1 | 0, GPRm | 0 | [0,1,2,3] | 0 | FMT_32_32_32_32_FLOAT | x | x | x, 0 | 0 | 1 | 0 | 0 | xxxxxxxxxx, 0] TEX_WORD[0,1,2] Fetch A matrix row data from TC at coordinate (x,y) stored in elements x, y, z, and w of the nth GPR holding the address. Use a non-structured vertex fetch and write the data back to the LDS. Do not use a coalesced vertex fetch. Also store the result in GPR m as four 32-bit floats into x,y,z,w. Start reading from offset 0, with no endian swap, 0 stride, no constants from another thread or a constant buffer. 7 LOOP_END [...] CF_WORD[0,1] End the loop over the rows. 8 LOOP_START [...] CF_WORD[0,1] Initiate a loop over all vertices 9 TC [...] CF_WORD[0,1] Execute a texture cache clause to real all vertices into GPRs 10 FETCH [...] TEX_WORD[0,1,2] Fetch vertex data from TC into GPR 11 ALU [...] CF_WORD[0,1] Initiate an ALU clause to calculate the dot product of each vertex with one row of A at a time. (This can be optimized by doing another loop to make this parallel, since the entire clause is executed serially.) 12,13,14,15 DOT4 [...] ALU_WORD[0,1] Calculate the dot product of row 0,1,2,3 with the vertex in GPR 16 MOV [...] ALU_WORD[0,1] Move the scalar result into a desitnation GPR 17 LOOP_END [...] CF_WORD[0,1] End the loop over vertices. 18 EXPORT / EXPORT_END [...] CF_WORD[0,1] Export from GPRs to the position buffer. 19 END CF_WORD[0,1] End the kernel


Host Program

The host program read the vertex shader from a file, compiles it, and binds it for execution through OpenGL. It also loads the vertex data into device memory through vertex buffer or vertex array objects, binds, and enables them. The host program then issues the draw call, and finally, through a window manager, requests for the framebuffer to be displayed.

Example code for an OpenGL 2.0+ host program is given below.

     #include <...>

// enumerations for attributes and uniform variables into the shader int ATTRIBUTE_POSITION = 1; int UNIFORM_MODELVIEWPROJ = 1;

float vertex_data = {...}; float model_view_projection = {...};

// compile, link, and validate the shader program int build() {

 char shader_source[1024];
 GLuint shader_handle = 0;
 GLint valid_status = 0;
 if(read_file_to_string(“Shader.vsh”, &shader_program))
   return 1;
 // load the shader source into the GLSL compiler
 glShaderSource(&shader_handle, 1, &shader_source, NULL);
 
 // compile the GLSL shader (use error checking when debugging...)
 glCompileShader(shader_handle);
 // link the GLSL shader
 glLinkProgram(shader_handle);
 // validate the shader
 glValidateProgram(shader_handle);
 // get a validation status
 glGetProgramiv(shader_handle, GL_VALIDATE_STATUS, &valid_status);
 // return 0 if shader is good
 if(status)
   return glUseProgram(shader_handle);
 else
   return 1;

}

// set the OpenGL configuration states int config() {

       // set up the model view projection matrix to pass into the shader
       glUniformMatrix4fv(uniforms[UNIFORM_MODELVIEWPROJ], 1, GL_FALSE, model_view_projection);
 
 // configure and enable the vertex buffers for sending to shader
 glVertexAttribPointer(ATTRIBUTE_POSITION, 4, GL_FLOAT, GL_FALSE, 0, vertex_data);
       glEnableVertexAttribArray(ATTRIBUTE_POSITION);
 return 0;

}


void display() {

 // clear the display
 glClearColor(0, 0, 0, 1);
 glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
 // draw the vertices
 glDrawArrays(GL_TRIANGLE_STRIP, 0,  sizeof(vertex_data)/(4*sizeof(GLfloat))); 
 … // present the framebuffer to the window

}

int main() {

 … // set up the window and device context
 if(!build() && !config())
   while(1)
     display();
 
 return 1;

}

Practicality

Note that due to the relative complexity of performing this computation through microcode, GLSL and other high level shading languages were developed. Such languages, as can be seen from the GLSL provided previously, provide a simplified means to execute data-parallel programming through applications. Graphics device driver developers are still required to explore the GPU instruction sets in order to implement the shading language compilers and to fully exploit the performance of their unique hardware.

Data-parallel programming with OpenCL

OpenCL is an API that adds an extension to the C99 language that provides a generic programming interface to general purpose parallel computing on heterogeneous processor architectures, including GPUs. OpenCL allows for data-parallel and task-parallel programming models.

OpenCL is an attempt at abstracting general purpose parallel programming, but underneath it resorts to sending commands through OpenGL graphics drivers and their shader compilers if available, rather than implementing their own microcode generators for each GPU available. This abstraction allows the application programmer to avoid putting his task in the framework of a graphics pipeline with vertices or pixels.


'References'

[1] AMD HD 6900 Series Instruction Set Architecture pdf [2] image thresholding [3] per pixel lighting calculations for a rendered scene [4] per vertex translation, rotation, and perspective transformations [5] http://www.khronos.org/opengles/ [6] http://en.wikipedia.org/wiki/Radeon_R600 [7] nVidia’s CUDA [8] OpenGL and the shader language GLSL [9] OpenCL [10] Wasson, Scott. AMD Radeon HD 2900 XT graphics processor: R600 revealed, Tech Report, May 14, 2007 [11] https://wiki.aalto.fi/display/GPGPUK2010/AMD+GPU [12] http://www.anandtech.com/show/4061/amds-radeon-hd-6970-radeon-hd-6950/4 [13] OpenGL programming guide, v1.1 [14] The OpenGL Graphics System: A Specification (Version 4.2 (Core Profile) - August 8, 2011) Mark Segal, Kurt Akeley [15] OpenGL ES programming guide for iOS http://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW1 [16] http://www.songho.ca/opengl/gl_projectionmatrix.html