Parallel Programming Models - Revision history

Dmcarmon: added to summary

2011-03-23T23:41:39Z

added to summary

← Older revision		Revision as of 23:41, 23 March 2011
Line 150:		Line 150:
	=Summary=		=Summary=

	Through the history of parallel computing we may observe the tradeoffs between the shared memory and message passing models. Although the models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References \| Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.		Through the history of parallel computing we may observe the tradeoffs between the shared memory and message passing models. In the beginning, due to parallel programming being done for the most part exclusively on expensive and custom-made super computers, shared memory systems made sense. Later on, the invention of the personal computer and wider availability of smaller and less expensive computers led more people to a message passing approach. Since cluster computers rely on message passing between separate processors, cluster computing has supported the move toward the message passing model. Although the models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References \| Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.

	One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.		One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.

Dmcarmon: /* References */

2011-03-23T23:27:15Z

References

← Older revision		Revision as of 23:27, 23 March 2011
Line 210:		Line 210:

	* Philip J. Hatcher, Michael Jay Quinn, ''Data-Parallel Programming on MIMD Computers'', The MIT Press, 1991.		* Philip J. Hatcher, Michael Jay Quinn, ''Data-Parallel Programming on MIMD Computers'', The MIT Press, 1991.

			* Shared Memory and Message Passing http://www.cs.cf.ac.uk/Parallel/Year2/section4.html

	* ''The period 1989 - 1994: ETA and CONVEX: between -40 and +40 Centigrade'' http://www.museumwaalsdorp.nl/computer/en/comp891E.html		* ''The period 1989 - 1994: ETA and CONVEX: between -40 and +40 Centigrade'' http://www.museumwaalsdorp.nl/computer/en/comp891E.html

Dmcarmon: /* Examples with the Data Parallel Model, Shared Memory, and Message Passing */

2011-03-23T23:25:00Z

Examples with the Data Parallel Model, Shared Memory, and Message Passing

← Older revision		Revision as of 23:25, 23 March 2011
Line 11:		Line 11:
	=Examples with the Data Parallel Model, Shared Memory, and Message Passing=		=Examples with the Data Parallel Model, Shared Memory, and Message Passing=

	While the shared memory and message passing models focus on how parallel tasks access common data, the data parallel model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data.		While the [http://www.cs.cf.ac.uk/Parallel/Year2/section4.html shared memory and message passing] models focus on how parallel tasks access common data, the data parallel model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data.

	==Preliminary Example==		==Preliminary Example==

Dmcarmon: /* Supplement to Chapter 2: The Data Parallel Programming Model */

2011-03-23T23:20:26Z

Supplement to Chapter 2: The Data Parallel Programming Model

← Older revision		Revision as of 23:20, 23 March 2011
Line 1:		Line 1:
	=Supplement to Chapter 2: The Data Parallel Programming Model=		=Supplement to Chapter 2: The Data Parallel Programming Model=

	Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [http://portal.acm.org/citation.cfm?id=1239917 \| data parallel] model (a model composed of a set of identical tasks which operate on different subsets of common data), another commonly recognized parallel programming model covered in other treatments like [http://www.mcs.anl.gov/~itf/dbpp/ Foster (1995)] and [[#References \| Culler (1999)]].		Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [http://portal.acm.org/citation.cfm?id=1239917 data parallel] model (a model composed of a set of identical tasks which operate on different subsets of common data), another commonly recognized parallel programming model covered in other treatments like [http://www.mcs.anl.gov/~itf/dbpp/ Foster (1995)] and [[#References \| Culler (1999)]].

	==Introduction==		==Introduction==

Dmcarmon at 23:14, 23 March 2011

2011-03-23T23:14:51Z

Dmcarmon: improved links

2011-03-23T22:54:42Z

improved links

← Older revision		Revision as of 22:54, 23 March 2011
Line 1:		Line 1:
	=Supplement to Chapter 2: The Data Parallel Programming Model=		=Supplement to Chapter 2: The Data Parallel Programming Model=

	Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [~~[#Definitions~~ \| ''data parallel~~'']~~] model, another commonly recognized parallel programming model covered in other treatments like [~~[#References~~ \| Foster (1995)]] and [[#References \| Culler (1999)]].		Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [http://portal.acm.org/citation.cfm?id=1239917 \| data parallel] model (a model composed of a set of identical tasks which operate on different subsets of common data), another commonly recognized parallel programming model covered in other treatments like [http://www.mcs.anl.gov/~itf/dbpp/ \| Foster (1995)] and [[#References \| Culler (1999)]].

	==Introduction==		==Introduction==
Line 7:		Line 7:
	The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.		The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.

	In addition to the data parallel model, the ~~[[#Definitions \| ''~~task parallel~~'']]~~ model will also be introduced briefly in [[#Appendix_B:_Data_Parallel_Versus_Task-parallel \| Appendix B]] as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].		In addition to the data parallel model, the task parallel model (a model composed of a set of differing tasks which operate on common data.) will also be introduced briefly in [[#Appendix_B:_Data_Parallel_Versus_Task-parallel \| Appendix B]] as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].

	=Examples with the Data Parallel Model, Shared Memory, and Message Passing=		=Examples with the Data Parallel Model, Shared Memory, and Message Passing=

	While the shared memory and message passing models focus on how parallel tasks access common data, the ~~[[#Definitions \| ''~~data parallel~~'']]~~ model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data.		While the shared memory and message passing models focus on how parallel tasks access common data, the data parallel model focuses on how to divide up work into parallel tasks. Data parallel algorithms exploit parallelism by dividing a problem into a number of identical tasks which execute on different subsets of common data.

	==Preliminary Example==		==Preliminary Example==
Line 121:		Line 121:
	==Vector Machines==		==Vector Machines==

	First appearing in the 1970s, vector machines were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.		First appearing in the 1970s, [http://en.wikipedia.org/wiki/Vector_processor
			\|vector machines] were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.

	The Solomon project at Westinghouse was one of the first machines to use vector operations.		The Solomon project at Westinghouse was one of the first machines to use vector operations.
Line 127:		Line 128:
	Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.		Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.

	Also, C.mmp came out in 1971 and was actually a multiple instruction multiple data values (MIMD) archetecture. It was composed of 16 PDP-11 minicomputers and had a 16x16 crossbar switch between the processors and 16 banks of shared memory.		Also, [http://research.microsoft.com/en-us/um/people/gbell/CGB%20Files/Cmmp%20Multi-Mini-Processor%20ComConference%201972%20c.pdf \| C.mmp] came out in 1971 and was actually a multiple instruction multiple data values (MIMD) archetecture. It was composed of 16 PDP-11 minicomputers and had a 16x16 crossbar switch between the processors and 16 banks of shared memory.

	An innovation came with the Cray-1 supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication.		An innovation came with the [http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730 \| Cray-1] supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased.
	In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased.
	The Cray-1 could perform at 240 MFLOPS.		The Cray-1 could perform at 240 MFLOPS.

	One of the later vector machines was the ETA10. It had shared memory 4M words and common memory 8M words, where each word was 64 bits. It was clocked at 24ns, but had a theorectical peak speed of 146 Mflops.		One of the later vector machines was the [http://www.museumwaalsdorp.nl/computer/en/comp891E.html \| ETA10]. It had shared memory 4M words and common memory 8M words, where each word was 64 bits. It was clocked at 24ns, but had a theorectical peak speed of 146 Mflops.

	Many of these early machines were shared memory machines. This is likely because memory was very expensive and message passing requires multiple copies of data. However, in the eighties cluster computing began to emerge, and popularized the message passing model.		Many of these early machines were shared memory machines. This is likely because memory was very expensive and message passing requires multiple copies of data. However, in the eighties cluster computing began to emerge, and popularized the message passing model.
Line 139:		Line 139:
	==Cluster Computing==		==Cluster Computing==

	The introduction of the personal computer in 1981 by IBM made smaller, cheaper computers were more available and fueled the cluster computing growth. For companies that couldn't afford to purchase a supercomputer, connecting many small computers to create a computer cluster may have been a more feasible solution when they needed more computing power. This setup uses the message passing model.		The introduction of the personal computer in 1981 by IBM made smaller, cheaper computers were more available and fueled the cluster computing growth. For companies that couldn't afford to purchase a supercomputer, connecting many small computers to create a [http://en.wikipedia.org/wiki/Computer_cluster \| computer cluster] may have been a more feasible solution when they needed more computing power. This setup uses the message passing model.

	Furthermore, the internet was being developed and the one of the first cluster systems, VMScluster (then known as VACcluster), was released in 1983. Pivotal in the development of cluster computing was the Parallel Virtual Machine (PVM). PVM allowed you to create a computer cluster with any		Furthermore, the internet was being developed and the one of the first cluster systems, [http://en.wikipedia.org/wiki/VMScluster \| VMScluster] (then known as VACcluster), was released in 1983. Pivotal in the development of cluster computing was the Parallel Virtual Machine (PVM). PVM allowed you to create a computer cluster with any machine that implementedf TCP/IP communication.
	machine that implementedf TCP/IP communication.

	==Distributed Memory and Message Passing==		==Distributed Memory and Message Passing==

	In the 1980s, a manufacturing limit led to increased support for multiprocessor systems. The transputer architecture by Inmos was one of the first general-purpose microprocessors designed for parallel computing. The first transputers were released in 1984. Transputers were designed to be easily interlinkable; multiple processing chips could be easily combined into one system.		In the 1980s, a manufacturing limit led to increased support for multiprocessor systems. The [http://en.wikipedia.org/wiki/Transputer \| transputer] architecture by Inmos was one of the first general-purpose microprocessors designed for parallel computing. The first transputers were released in 1984. Transputers were designed to be easily interlinkable; multiple processing chips could be easily combined into one system.

	Each transputer processor could communicate with up to four other processors at up to 20 Mbps. Any number of processors could be combined into a massive processing farm. Of course, in large nets, the delay would be too great for any significant message passing.		Each transputer processor could communicate with up to four other processors at up to 20 Mbps. Any number of processors could be combined into a massive processing farm. Of course, in large nets, the delay would be too great for any significant message passing.
Line 156:		Line 155:
	One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.		One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.

	Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [~~[#Definitions~~ \| ''SIMD (single-instruction-multiple-data)~~'']~~] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix: C for CUDA Example Code \| Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.		Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [http://en.wikipedia.org/wiki/SIMD \| SIMD (single-instruction-multiple-data)] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix C: C for CUDA Example Code \| Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.

	Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data, and the regular data chunks also make it easier to reason about where to locate data and how to organize it. On the other hand, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.		Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data, and the regular data chunks also make it easier to reason about where to locate data and how to organize it. On the other hand, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.
Line 236:		Line 235:
	* Wikipedia, SPMD [http://en.wikipedia.org/wiki/SPMD http://en.wikipedia.org/wiki/SPMD].		* Wikipedia, SPMD [http://en.wikipedia.org/wiki/SPMD http://en.wikipedia.org/wiki/SPMD].

	* Wikipedia, Transputer [http://en.wikipedia.org/wiki/Transputer]		* Wikipedia, Transputer http://en.wikipedia.org/wiki/Transputer

	* Wikipedia, Vector processor http://en.wikipedia.org/w/~~index.php?title=~~Vector_processor~~&oldid=405209552~~		* Wikipedia, Vector processor http://en.wikipedia.org/wiki/Vector_processor

	* Wikipedia, VMScluster http://en.wikipedia.org/wiki/VMScluster		* Wikipedia, VMScluster http://en.wikipedia.org/wiki/VMScluster

Mpsenn: /* Superscalar Machines */ link for superscalar

2011-01-31T17:11:22Z

Superscalar Machines: link for superscalar

← Older revision		Revision as of 17:11, 31 January 2011
Line 115:		Line 115:
	==Superscalar Machines==		==Superscalar Machines==

	A superscalar processor is a pipelined processor able to retire multiple instructions in one cycle. The first superscalar machine was Cray's CDC 6600. It was released in 1964 and could execute 1 million floating point operations per second (1 MFLOP).		A [http://en.wikipedia.org/wiki/Superscalar superscalar] processor is a pipelined processor able to retire multiple instructions in one cycle. The first superscalar machine was Cray's CDC 6600. It was released in 1964 and could execute 1 million floating point operations per second (1 MFLOP).

	The CDC 6600 gained most of it's speed through delegating memory access and I/O to other processors, handling only arithmetic and logic. These peripheral processors, and the main CPU, could be designed to be as simple as possible. The CDC 6600 remained the world's fastest computer until 1969, being replaced by the CDC 7600.		The CDC 6600 gained most of it's speed through delegating memory access and I/O to other processors, handling only arithmetic and logic. These peripheral processors, and the main CPU, could be designed to be as simple as possible. The CDC 6600 remained the world's fastest computer until 1969, being replaced by the CDC 7600.

Dmcarmon at 17:06, 31 January 2011

2011-01-31T17:06:16Z

← Older revision		Revision as of 17:06, 31 January 2011
Line 152:		Line 152:
	=Summary=		=Summary=

	~~Although~~ the shared memory and message passing models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References \| Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.		Through the history of parallel computing we may observe the tradeoffs between the shared memory and message passing models. Although the models may be combined into hybrid approaches, the two models are fundamentally different ways of addressing the same problem (of access control to common data). In contrast, the data parallel model is concerned with a fundamentally different problem (how to divide work into parallel tasks). As such, the data parallel model may be used in conjunction with either the shared memory or the message passing model without conflict. In fact, [[#References \| Klaiber (1994)]] compares the performance of a number of data parallel programs implemented with both shared memory and message passing models.

	One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.		One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.

Mpsenn: /* Introduction */

2011-01-31T17:04:50Z

Introduction

← Older revision		Revision as of 17:04, 31 January 2011
Line 7:		Line 7:
	The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.		The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.

	In addition to the data parallel model, the [[#Definitions \| ''task parallel'']] model will also be introduced briefly in [[#~~Appendix B~~ \| Appendix B]] as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].		In addition to the data parallel model, the [[#Definitions \| ''task parallel'']] model will also be introduced briefly in [[#Appendix_B:_Data_Parallel_Versus_Task-parallel \| Appendix B]] as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].

	=Examples with the Data Parallel Model, Shared Memory, and Message Passing=		=Examples with the Data Parallel Model, Shared Memory, and Message Passing=

Mpsenn: /* Introduction */ Adding link

2011-01-31T17:04:00Z

Introduction: Adding link

← Older revision		Revision as of 17:04, 31 January 2011
Line 7:		Line 7:
	The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.		The data parallel programming model first appeared in the eighties as a programming model for SIMD (Single Instruction, Multiple Data) parallel machines. It's defined as multiple processing elements performing an action simultaneously on different parts of a data set and exchanging information globally before processing more code synchronously. Although the shared memory and message passing models are often presented as competing models, the data parallel model addresses fundamentally different programming concerns and can be used in conjunction with either. The main distinction between the data parallel model and the other two models has to do with the outcome of the individual steps instead of the method of communication. The data parallel model was developed for scientific calculations and is generally associated with applications that involve a data set which is typically organized into a common structure, such as an array or matrix. Data parallel processing has been found to be effective in situations where the computations allow the processing to be divided spatially over memories by involving every element of a matrix in a uniform way.

	In addition to the data parallel model, the [[#Definitions \| ''task parallel'']] model will also be introduced briefly in Appendix B as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].		In addition to the data parallel model, the [[#Definitions \| ''task parallel'']] model will also be introduced briefly in [[#Appendix B \| Appendix B]] as a point of contrast with the data-parallel model. Furthermore, we will discuss the role of the shared memory and message passing models in the history of parallel computing. The goal of this supplement is to provide a treatment of the data parallel model which complements Chapter 2 of [[#References \| Solihin (2008)]].

	=Examples with the Data Parallel Model, Shared Memory, and Message Passing=		=Examples with the Data Parallel Model, Shared Memory, and Message Passing=

← Older revision		Revision as of 23:14, 23 March 2011
Line 1:		Line 1:
	=Supplement to Chapter 2: The Data Parallel Programming Model=		=Supplement to Chapter 2: The Data Parallel Programming Model=

	Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [http://portal.acm.org/citation.cfm?id=1239917 \| data parallel] model (a model composed of a set of identical tasks which operate on different subsets of common data), another commonly recognized parallel programming model covered in other treatments like [http://www.mcs.anl.gov/~itf/dbpp/ \| Foster (1995)] and [[#References \| Culler (1999)]].		Chapter 2 of [[#References \| Solihin (2008)]] covers the shared memory and message passing parallel programming models. However, it does not address the [http://portal.acm.org/citation.cfm?id=1239917 \| data parallel] model (a model composed of a set of identical tasks which operate on different subsets of common data), another commonly recognized parallel programming model covered in other treatments like [http://www.mcs.anl.gov/~itf/dbpp/ Foster (1995)] and [[#References \| Culler (1999)]].

	==Introduction==		==Introduction==
Line 121:		Line 121:
	==Vector Machines==		==Vector Machines==

	First appearing in the 1970s, [http://en.wikipedia.org/wiki/Vector_processor		First appearing in the 1970s, [http://en.wikipedia.org/wiki/Vector_processor vector machines] were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.
	\|vector machines] were able to apply a single instruction to multiple data values. This type of operation is used frequently in scientific fields or in multimedia.

	The Solomon project at Westinghouse was one of the first machines to use vector operations.		The Solomon project at Westinghouse was one of the first machines to use vector operations.
Line 128:		Line 127:
	Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.		Solomon was unsuccessful and was cancelled, eventually to be reborn as the ILLIAC IV at the University of Illinois. The ILLIAC IV showed great success at solving data-intensive problems, peaking at 150 MFLOPS under the right conditions.

	Also, [http://research.microsoft.com/en-us/um/people/gbell/CGB%20Files/Cmmp%20Multi-Mini-Processor%20ComConference%201972%20c.pdf \| C.mmp] came out in 1971 and was actually a multiple instruction multiple data values (MIMD) archetecture. It was composed of 16 PDP-11 minicomputers and had a 16x16 crossbar switch between the processors and 16 banks of shared memory.		Also, [http://research.microsoft.com/en-us/um/people/gbell/CGB%20Files/Cmmp%20Multi-Mini-Processor%20ComConference%201972%20c.pdf C.mmp] came out in 1971 and was actually a multiple instruction multiple data values (MIMD) archetecture. It was composed of 16 PDP-11 minicomputers and had a 16x16 crossbar switch between the processors and 16 banks of shared memory.

	An innovation came with the [http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730 \| Cray-1] supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased.		An innovation came with the [http://en.wikipedia.org/w/index.php?title=Cray-1&oldid=409177730 Cray-1] supercomputer in 1976. It was realized that the large data sets are often manipulated by several instructions back-to-back, such as an addition followed by a multiplication. In the ILLIAC, up to 64 data points were loaded from memory with every instruction, but had to be stored back to manipulate the rest of the vector. The Cray computer was only able to load 12 data points, but by completing multiple instructions before continuing the total number of memory accesses decreased.
	The Cray-1 could perform at 240 MFLOPS.		The Cray-1 could perform at 240 MFLOPS.

	One of the later vector machines was the [http://www.museumwaalsdorp.nl/computer/en/comp891E.html \| ETA10]. It had shared memory 4M words and common memory 8M words, where each word was 64 bits. It was clocked at 24ns, but had a theorectical peak speed of 146 Mflops.		One of the later vector machines was the [http://www.museumwaalsdorp.nl/computer/en/comp891E.html ETA10]. It had shared memory 4M words and common memory 8M words, where each word was 64 bits. It was clocked at 24ns, but had a theorectical peak speed of 146 Mflops.

	Many of these early machines were shared memory machines. This is likely because memory was very expensive and message passing requires multiple copies of data. However, in the eighties cluster computing began to emerge, and popularized the message passing model.		Many of these early machines were shared memory machines. This is likely because memory was very expensive and message passing requires multiple copies of data. However, in the eighties cluster computing began to emerge, and popularized the message passing model.
Line 139:		Line 138:
	==Cluster Computing==		==Cluster Computing==

	The introduction of the personal computer in 1981 by IBM made smaller, cheaper computers were more available and fueled the cluster computing growth. For companies that couldn't afford to purchase a supercomputer, connecting many small computers to create a [http://en.wikipedia.org/wiki/Computer_cluster \| computer cluster] may have been a more feasible solution when they needed more computing power. This setup uses the message passing model.		The introduction of the personal computer in 1981 by IBM made smaller, cheaper computers were more available and fueled the cluster computing growth. For companies that couldn't afford to purchase a supercomputer, connecting many small computers to create a [http://en.wikipedia.org/wiki/Computer_cluster computer cluster] may have been a more feasible solution when they needed more computing power. This setup uses the message passing model.

	Furthermore, the internet was being developed and the one of the first cluster systems, [http://en.wikipedia.org/wiki/VMScluster \| VMScluster] (then known as VACcluster), was released in 1983. Pivotal in the development of cluster computing was the Parallel Virtual Machine (PVM). PVM allowed you to create a computer cluster with any machine that implementedf TCP/IP communication.		Furthermore, the internet was being developed and the one of the first cluster systems, [http://en.wikipedia.org/wiki/VMScluster VMScluster] (then known as VACcluster), was released in 1983. Pivotal in the development of cluster computing was the Parallel Virtual Machine (PVM). PVM allowed you to create a computer cluster with any machine that implementedf TCP/IP communication.

	==Distributed Memory and Message Passing==		==Distributed Memory and Message Passing==

	In the 1980s, a manufacturing limit led to increased support for multiprocessor systems. The [http://en.wikipedia.org/wiki/Transputer \| transputer] architecture by Inmos was one of the first general-purpose microprocessors designed for parallel computing. The first transputers were released in 1984. Transputers were designed to be easily interlinkable; multiple processing chips could be easily combined into one system.		In the 1980s, a manufacturing limit led to increased support for multiprocessor systems. The [http://en.wikipedia.org/wiki/Transputer transputer] architecture by Inmos was one of the first general-purpose microprocessors designed for parallel computing. The first transputers were released in 1984. Transputers were designed to be easily interlinkable; multiple processing chips could be easily combined into one system.

	Each transputer processor could communicate with up to four other processors at up to 20 Mbps. Any number of processors could be combined into a massive processing farm. Of course, in large nets, the delay would be too great for any significant message passing.		Each transputer processor could communicate with up to four other processors at up to 20 Mbps. Any number of processors could be combined into a massive processing farm. Of course, in large nets, the delay would be too great for any significant message passing.
Line 155:		Line 154:
	One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.		One of the major advantages of combining the data parallel and message passing models is a reduction in the amount and complexity of communication required relative to a task parallel approach. Similarly, combining the data parallel and shared memory models tends to simplify and reduce the amount of synchronization required. If the task parallel code given in the appendix were modified from a message passing model to a shared memory model, the two threads would require 8 signals be sent between the threads (instead of 8 messages). In contrast, the data parallel code would require a single barrier before the local sums are added to compute the full sum.

	Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [http://en.wikipedia.org/wiki/SIMD \| SIMD (single-instruction-multiple-data)] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix C: C for CUDA Example Code \| Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.		Much as the shared memory model can benefit from specialized hardware, the data parallel programming model can as well. [http://en.wikipedia.org/wiki/SIMD SIMD (single-instruction-multiple-data)] processors are specifically designed to run data parallel algorithms. These processors perform a single instruction on many different data locations simultaneously. Modern examples include [http://en.wikipedia.org/wiki/CUDA CUDA processors] developed by nVidia and [http://en.wikipedia.org/wiki/Cell_%28microprocessor%29 Cell processors] developed by STI (Sony, Toshiba, and IBM). For the curious, example code for CUDA processors is provided in the [[#Appendix C: C for CUDA Example Code \| Appendix]]. However, whereas the shared memory model can be a difficult and costly abstraction in the absence of hardware support, the data parallel model—like the message passing model—does not require hardware support.

	Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data, and the regular data chunks also make it easier to reason about where to locate data and how to organize it. On the other hand, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.		Since data parallel code tends to simplify communication and synchronization, data parallel code may be easier to develop than a more task parallel approach. Once written, data parallel programs can scale easily to large numbers of processors. The data parallel model implicitly encourages data locality by having each thread work on a chunk of data, and the regular data chunks also make it easier to reason about where to locate data and how to organize it. On the other hand, it is possible that a problem may not decompose easily into subproblems relying on largely independent chunks of data. In this case, it may be impractical or impossible to apply the data parallel model.