CSC/ECE 506 Spring 2012/ch2b cm - Revision history

Mkotyad: /* Architecture overview */

2012-02-04T00:41:23Z

Architecture overview

← Older revision		Revision as of 00:41, 4 February 2012
Line 27:		Line 27:
	In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA <ref name=FermiArch> http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf </ref>. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.		In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA <ref name=FermiArch> http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf </ref>. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.

	Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores. The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler and multiple DRAM interfaces. ~~(-link- details?)~~		Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores. The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler [http://cdn.arduer.com/wp-content/uploads/2009/10/fermi-17-20.pdf]and multiple DRAM [http://en.wikipedia.org/wiki/Dynamic_random-access_memory] interfaces.An L2 cache resides on a separate memory chip outside of the processor.

	[[File:Fermi arch.jpg]]		[[File:Fermi arch.jpg]]

Capsang: /* Questions */

2012-02-03T22:23:49Z

Questions

← Older revision		Revision as of 22:23, 3 February 2012
Line 231:		Line 231:
	2.Name the Latest GPU whose Architecture is discussed in this article.		2.Name the Latest GPU whose Architecture is discussed in this article.

	3.How many CUDA Processors are present in the NVIDIA Fermi Architecture?		3.How many CUDA Processors/cores are present in the NVIDIA Fermi Architecture?

	4.How are kernels in CUDA Programming different from normal C functions?		4.How are kernels in CUDA Programming different from normal C functions?

	5.What are the different memory ~~spaces~~ provided by the CUDA Architecture?		5.What are the different memory components provided by the CUDA Architecture?

	6.How does SIMD work?		6.How does SIMD work?
Line 241:		Line 241:
	7.What is an Embarrassingly Parallel Problem?		7.What is an Embarrassingly Parallel Problem?

	8.Can a thread block be three dimensional?		8.Can a thread block be three dimensional? If yes what purpose does it serve?

	9.~~If Yes,~~ How do you access a specific thread in a block?		9.How do you access a specific thread in a block?

	10.Name one interesting research happening in Parallel Computing.		10.Name one interesting research happening in Parallel Computing.

Capsang: /* Data parallelism */

2012-02-03T22:22:27Z

Data parallelism

← Older revision		Revision as of 22:22, 3 February 2012
Line 14:		Line 14:

	=== Data parallelism ===		=== Data parallelism ===
	Data parallelism refers to scenarios in which the same operation is performed concurrently (that is, in parallel) on elements in a source collection or array. Data parallelism with imperative syntax is supported by several overloads of the For and ForEach methods in the System.Threading.Tasks.Parallel class. In data parallel operations, the source collection is partitioned so that multiple threads can operate on different segments concurrently.		[http://en.wikipedia.org/wiki/Data_parallelism Data parallelism] refers to scenarios in which the same operation is performed concurrently (that is, in parallel) on elements in a source collection or array. Data parallelism with imperative syntax is supported by several overloads of the For and ForEach methods in the System.Threading.Tasks.Parallel class. In data parallel operations, the source collection is partitioned so that multiple threads can operate on different segments concurrently.

	=== Embarrassingly Parallel problems ===		=== Embarrassingly Parallel problems ===

Capsang: /* Memory Hierarchy */

2012-02-03T22:19:07Z

Memory Hierarchy

← Older revision		Revision as of 22:19, 3 February 2012
Line 88:		Line 88:
	[[File:CUDA_Memory.PNG]]		[[File:CUDA_Memory.PNG]]

	Kernels can only operate out of device memory (no access to Host memory), so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory. Device memory can be allocated either as linear memory or as CUDA arrays (optimized for texture fetching ~~-link-~~). Linear memory on device exists in a 32 bit address space, so separately allocated entities can reference one another via pointers, e.g in a linked list. Linear memory is allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are performed using cudaMemcpy().Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements, ensuring the best performance when accessing and performing copies of 2D or 3D arrays.		Kernels can only operate out of device memory (no access to Host memory), so the runtime provides functions to allocate, deallocate, and copy device memory, as well as transfer data between host memory and device memory. Device memory can be allocated either as linear memory or as CUDA arrays (optimized for texture fetching). Linear memory on device exists in a 32 bit address space, so separately allocated entities can reference one another via pointers, e.g in a linked list. Linear memory is allocated using cudaMalloc() and freed using cudaFree() and data transfer between host memory and device memory are performed using cudaMemcpy().Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D(). These functions are recommended for allocations of 2D or 3D arrays as it makes sure that the allocation is appropriately padded to meet the alignment requirements, ensuring the best performance when accessing and performing copies of 2D or 3D arrays.

	== Traditional Problem ==		== Traditional Problem ==

Capsang: /* Terminology */

2012-02-03T22:17:42Z

Terminology

← Older revision		Revision as of 22:17, 3 February 2012
Line 18:		Line 18:
	=== Embarrassingly Parallel problems ===		=== Embarrassingly Parallel problems ===
	In parallel computing, an [http://en.wikipedia.org/wiki/Embarrassingly_parallel embarrassingly parallel] workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.		In parallel computing, an [http://en.wikipedia.org/wiki/Embarrassingly_parallel embarrassingly parallel] workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.

	~~=== Processing Entity ===~~

	=== Host vs Device ===		=== Host vs Device ===
			In GPU literature, Host refers to the CPU part of the system and Device always refers to the GPU. Typically ''device'' is physically a separate from the ''host'' and operates as a co-processor. Also, Host and the device maintain their own separate memory spaces in DRAM, referred to as ''host memory'' and ''device memory'', respectively.

	== Some Basics ==		== Some Basics ==

Capsang: /* Cache and Memory Architecture */

2012-02-03T21:40:32Z

Cache and Memory Architecture

← Older revision		Revision as of 21:40, 3 February 2012
Line 49:		Line 49:

	====Cache and Memory Architecture====		====Cache and Memory Architecture====
	Fermi provides a fully cached memory access with a unified cache architecture that supports graphics and compute programs. Each thread that runs on a single core of a Multiprocessor has access to a super fast ~~texture cache <ref name=TextureCache>~~http://wiki.secondlife.com/wiki/Texture_Cache~~</ref>~~, an L1 cache and Shared memory. Each Fermi GPU is also equipped with an L2 cache (768KB in size for a 512-core chip). The L2 cache covers GPU local DRAM as well as system memory. The L2 cache subsystem also implements another feature not found on CPUs: a set of memory read-modify-write operations that are atomic, and thus ideal for managing access to data that must be shared across thread blocks or even kernels. L1 and L2 caches help in improving the random memory access performance while the texture cache enables faster texture filtering ~~-link-~~ .		Fermi provides a fully cached memory access with a unified cache architecture that supports graphics and compute programs. Each thread that runs on a single core of a Multiprocessor has access to a super fast [http://wiki.secondlife.com/wiki/Texture_Cache texture cache], an L1 cache and Shared memory. Each Fermi GPU is also equipped with an L2 cache (768KB in size for a 512-core chip). The L2 cache covers GPU local DRAM as well as system memory. The L2 cache subsystem also implements another feature not found on CPUs: a set of memory read-modify-write operations that are atomic, and thus ideal for managing access to data that must be shared across thread blocks or even kernels. L1 and L2 caches help in improving the random memory access performance while the texture cache enables faster [http://en.wikipedia.org/wiki/Texture_filtering texture filtering].
	The programs also have access to a dedicated Shared Memory which is a small software-managed data cache attached to each multiprocessor, shared among the cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds.		The programs also have access to a dedicated Shared Memory which is a small software-managed data cache attached to each multiprocessor, shared among the cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds.

Capsang: /* Architecture overview */

2012-02-03T21:38:56Z

Architecture overview

← Older revision		Revision as of 21:38, 3 February 2012
Line 28:		Line 28:
	In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA <ref name=FermiArch> http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf </ref>. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.		In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA <ref name=FermiArch> http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf </ref>. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.

	Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores ~~(Processing entities -link-)~~. The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler and multiple DRAM interfaces. (-link- details?)		Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores. The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler and multiple DRAM interfaces. (-link- details?)

	[[File:Fermi arch.jpg]]		[[File:Fermi arch.jpg]]
Line 49:		Line 49:

	====Cache and Memory Architecture====		====Cache and Memory Architecture====
	Fermi provides a fully cached memory access with a unified cache architecture that supports graphics and compute programs. Each thread that runs on a single core of a Multiprocessor has access to a super fast texture cache <ref name=TextureCache>http://wiki.secondlife.com/wiki/Texture_Cache</ref>, an L1 cache and Shared memory. Each Fermi GPU is also equipped with an L2 cache (768KB in size for a 512-core chip). The L2 cache covers GPU local DRAM as well as system memory. The L2 cache subsystem also implements another feature not found on CPUs: a set of memory read-modify-write operations that are atomic, and thus ideal for managing access to data that must be shared		Fermi provides a fully cached memory access with a unified cache architecture that supports graphics and compute programs. Each thread that runs on a single core of a Multiprocessor has access to a super fast texture cache <ref name=TextureCache>http://wiki.secondlife.com/wiki/Texture_Cache</ref>, an L1 cache and Shared memory. Each Fermi GPU is also equipped with an L2 cache (768KB in size for a 512-core chip). The L2 cache covers GPU local DRAM as well as system memory. The L2 cache subsystem also implements another feature not found on CPUs: a set of memory read-modify-write operations that are atomic, and thus ideal for managing access to data that must be shared across thread blocks or even kernels. L1 and L2 caches help in improving the random memory access performance while the texture cache enables faster texture filtering -link- .
	across thread blocks or even kernels. L1 and L2 caches help in improving the random memory access performance while the texture cache enables faster texture filtering -link- .
	The programs also have access to a dedicated Shared Memory which is a small software-managed data cache attached to each multiprocessor, shared among the cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds.		The programs also have access to a dedicated Shared Memory which is a small software-managed data cache attached to each multiprocessor, shared among the cores. This is a low-latency, high-bandwidth, indexable memory which runs essentially at register speeds.

Capsang at 21:38, 3 February 2012

2012-02-03T21:38:15Z

← Older revision		Revision as of 21:38, 3 February 2012
Line 60:		Line 60:
	Fermi's complexity is hidden by a multi-level programming model that allows the programmers to focus on algorithm design, rather than hardware specific details.In NVIDIA’s CUDA software platform, as well as in the industry-standard OpenCL framework, the computational elements of algorithms are known as '''kernels'''. Kernels can be written in standard C language, with additional key words to express parallelism. They look similar to C functions, except that they are executed parallely by multiple ''threads'' on multiple processing entities.		Fermi's complexity is hidden by a multi-level programming model that allows the programmers to focus on algorithm design, rather than hardware specific details.In NVIDIA’s CUDA software platform, as well as in the industry-standard OpenCL framework, the computational elements of algorithms are known as '''kernels'''. Kernels can be written in standard C language, with additional key words to express parallelism. They look similar to C functions, except that they are executed parallely by multiple ''threads'' on multiple processing entities.

	Threads within a kernel are grouped into ''Blocks'' of 1024 threads. This hard limit is imposed by the Fermi hardware which allows a maximum of 1024 threads or 32 warps per thread block. All of the threads in a block will run on a single SM, so within the thread block, threads can cooperate and have access to the shared memory. At hardware level, the 1024 threads within a block are divided into warps of 32 threads (warp is the fundamental unit of dispatch within a single SM). Two warps from different thread blocks may be executed on a single SM, increasing the energy efficiency and hardware utilization. Fermi supports 48 active warps, that is, a total of 48*32, 1536 threads to be active simultaneously on each multiprocessor.		Threads within a kernel are grouped into ''Blocks'' of 1024 threads. This hard limit is imposed by the Fermi hardware which allows a maximum of 1024 threads or 32 warps per thread block. All of the threads in a block will run on a single SM, so within the thread block, threads can cooperate and have access to the shared memory. At hardware level, the 1024 threads within a block are divided into warps of 32 threads (warp is the fundamental unit of dispatch within a single SM). Two warps from different thread blocks may be executed on a single SM, increasing the energy efficiency and hardware utilization. Fermi supports 48 active warps, that is, a total of 48*32, 1536 threads to be active simultaneously on each multiprocessor <ref>http://www.pgroup.com/lit/articles/insider/v2n1a5.htm</ref>.

	The entire Fermi hardware is available for execution to a single application at any point of time. But the context switch time between the applications is short enough, so that a Fermi GPU can still maintain high utilization even when running multiple applications. This fast switching is enabled by the chip-level GigaThread hardware thread scheduler, which manages 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels.		The entire Fermi hardware is available for execution to a single application at any point of time. But the context switch time between the applications is short enough, so that a Fermi GPU can still maintain high utilization even when running multiple applications. This fast switching is enabled by the chip-level GigaThread hardware thread scheduler, which manages 1,536 simultaneously active threads for each streaming multiprocessor across 16 kernels.

Capsang: /* Architecture overview */

2012-02-03T21:36:50Z

Architecture overview

← Older revision		Revision as of 21:36, 3 February 2012
Line 26:		Line 26:
	=== Architecture overview ===		=== Architecture overview ===

	In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.		In this Section , we will discuss the Architecture of GeForce gtx 580 GPU from NVIDIA <ref name=FermiArch> http://www.nvidia.com/content/PDF/fermi_white_papers/P.Glaskowsky_NVIDIA%27s_Fermi-The_First_Complete_GPU_Architecture.pdf </ref>. The GTX 400/500 series GPU’s are based on NVIDIA’s Fermi Architecture, which has been the most significant leap in terms of Architecture design since the advent of the unified processor designs in the G80s.

	Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores (Processing entities -link-). The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler and multiple DRAM interfaces. (-link- details?)		Fermi's GPU Architecture consists of multiple Streaming Multiprocessors (Also referred as Multiprocessors or SMs), each with its own 32 execution cores, as shown in the figure. Typically Fermi hardware has 16 multiprocessors, hence a total of 512 cores (Processing entities -link-). The multiprocessors share an L2 cache, host Interface, GigaThread Scheduler and multiple DRAM interfaces. (-link- details?)

Capsang: /* Why GPUs */

2012-02-03T21:34:29Z

Why GPUs

← Older revision		Revision as of 21:34, 3 February 2012
Line 4:		Line 4:
	=== Why GPUs ===		=== Why GPUs ===

	If we look at the modern CPUs and GPUs, and compare their performance (in terms of floating point capability) <ref name=FloatingPoint> http://www.deskeng.com/articles/aaayet.htm </ref>, we notice that GPUs clearly outclass their competitors by a huge margin. This is because GPU is specialized for compute-intensive, highly parallel computation – which is what graphics rendering is all about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control (as the following diagram illustrates). Specifically, GPU is best suited for ~~data parallel <ref name=DataParallel>~~http://en.wikipedia.org/wiki/Parallel_computing#Data_parallelism~~</ref>~~ applications, with high arithmetic intensity – the ratio of arithmetic operations to memory operations. In such applications where the same set of instructions are executed for each data element, there's minimal need of flow control,while exhibiting high arithmetic intensity. Hence the speed up achieved by super fast parallel arithmatic computations will overshadow the latency incurred by memory accesses and flow control. The following diagram illustrates on a high level, the differences between CPU and GPU architectures.		If we look at the modern CPUs and GPUs, and compare their performance (in terms of floating point capability) <ref name=FloatingPoint> http://www.deskeng.com/articles/aaayet.htm </ref>, we notice that GPUs clearly outclass their competitors by a huge margin. This is because GPU is specialized for compute-intensive, highly parallel computation – which is what graphics rendering is all about – and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control (as the following diagram illustrates). Specifically, GPU is best suited for [http://en.wikipedia.org/wiki/Parallel_computing#Data_parallelism data parallel] applications, with high arithmetic intensity – the ratio of arithmetic operations to memory operations. In such applications where the same set of instructions are executed for each data element, there's minimal need of flow control,while exhibiting high arithmetic intensity. Hence the speed up achieved by super fast parallel arithmatic computations will overshadow the latency incurred by memory accesses and flow control. The following diagram illustrates on a high level, the differences between CPU and GPU architectures.