CSC/ECE 506 Fall 2007/wiki1 6 bn

From Expertiza_Wiki
Jump to navigation Jump to search

Sections 1.2.1 and 1.2.4: Communication architecture.

Trends in last 10 years. How has data parallelism found its way into shared-memory and message-passing machines? An early example would be MMX. Would you change the number of layers in Fig. 1.13?

A parallel computer is “a collection of processing elements that communicate and cooperate to solve large problems fast,”

Basically parallel architecture uses concepts of computer architecture with communication architecture as building block

Computer architecture has two distinct aspects:

  1. Definition of abstractions, such as hardware/software boundary and user/system boundary.
  2. The organizational structure that brings together these abstractions

Communication architecture has these two aspects as well. It defines the basic communication and synchronization operations, and addresses the organizational structures.

Traditionally software has been written for serial computation to be executed on a single CPU (Central Processing Unit), a problem is broken into discrete series of instructions which are executed one after other and only one instruction can be executed at a time. On other hand parallel computing is the simultaneous use of multiple computer resources to solve a computational problem.

There have been many micro-architectural techniques used on a CPU to achieve parallelism like instruction pipelining, superscalar, vector processing etc. All of those techniques exploit ILP (Instruction level parallelism). ILP allows the compiler and the processor to overlap the execution of multiple instructions.

The next step toward parallelism was the multiprocessing machines (having more than one CPUs within a single computer system) in multiprocessing system all CPUs can be equal or may be reserved for special purpose. Systems that treat all CPUs equally are called symmetric multiprocessing (SMP) systems. In systems where all CPUs are not equal, system resources may be divided in a number of ways, including asymmetric multiprocessing (ASMP), non-uniform memory access (NUMA) multiprocessing, and clustered multiprocessing.

In multiprocessing, the processors can be used to execute a single sequence of instructions in multiple contexts (single-instruction, multiple-data or SIMD, often used in vector processing), multiple sequences of instructions in a single context (multiple-instruction, single-data or MISD, used for redundancy in fail-safe systems and sometimes applied to describe pipelined processors or hyper-threading), or multiple sequences of instructions in multiple contexts (multiple-instruction, multiple-data or MIMD).

http://www4.ncsu.edu/~sbhatia2/images/fig1_13.jpg

Layers of abstraction in parallel computer architecture Instruction-level parallelism allows the compiler and the processor to overlap the execution of multiple instructions or to change the order in which instructions maybe executed, providing much opportunity for data parallelism. So ILP comes into being in case of Data Parallelism and in some extent in case of Shared Address and Message Passing Models. However TLP, plays a major role in case of Shared Address and Message Passing Models.

Communication in multiprocessing

In multiprocessing communication between processes can be performed either via Shared Memory or with message passing, either of which may be implemented in terms of other.

Shared Memory

  1. All processors are connected to same memory and they can access same memory locations.
  2. A shared memory system is relatively easy to program since all processors share a single view of data and the communication between processors can be as fast as memory accesses to a same location.
  3. CPU to memory connection can become bottleneck
  4. Cache coherence issues needs to be resolved.

Message Passing

  1. Set of processes having only local memory
  2. Processes communicate by sending and receiving messages
  3. The transfer of data between processes requires cooperative operations to be performed by each process
  4. Buffer management for sending and receiving messages
  5. Operating System take care for copying data from one process memory to other process memory
  6. DMA can enable non-blocking operations.

Data Parallel Processing

Also Know as processor arrays, SIMD machines or data parallel architectures.

  1. Operations are performed in parallel on each element of data structure such as an array or matrix
  2. Scientific computations frequently involve uniform calculation on every element of an array
  3. Parallel data is distributed over the memories of the data processors
  4. Scalar data is retained in the control processor’s memory

On a multiprocessing machine each program can run on a different processor and a computer program can also be divided into multiple sub-programs (threads) with each part is running on different processor in each of case a part of program may contains mathematical computation for different sets of data so we still can further improve our program by exploiting ILP techniques on multiprocessing machine having multiple ALUs for each CPU on the system which leads to data parallelism.

Levels of Parallelism

Levels of parallelism decided based on the lump of code (grain size) that can be potential candidate for parallelism. All of approaches have a common goal to boost processor efficiency and to minimize latency. Dividing program into multiple threads (control level) can give medium level granularity but to achieve finer level granularity we have to go for data level parallelism


http://www4.ncsu.edu/~sbhatia2/images/task.jpg

Trends in last 10 years

There have been many trends how data parallelism can be found its way into shared-memory and message passing machines.

1997

  1. Intel introduced their Pentium line of microprocessors designated as “Pentium with MMX Technology”. MMX was only operating on integer math
  2. Pentium 2 (Intel) Based on the Pentium Pro, and carrying the MMX features of the P55C. First x86 processor on a module, with L2 cache on the PC board
  3. K6 (AMD) First Pentium 2 competitor, based on a RISC design with an x86 translation layer.

1998

  1. Pentium 2 Deschutes (Intel) Process shrink to .25µm.
  2. PowerPC 750 (AKA G3) (Apple, IBM and Motorola)
  3. AMD launched 3DNow designed to improve a CPU’s ability to perform vector processing requirements of many graphic intensive applications. It extended to floating point calculation as well as integer calculation on parallel.

1999

  1. Celeron (Intel) Bargain version of the Pentium 2.
  2. Pentium 3 (Intel) Based on the P2's design, new core. Substantially faster than P2. Adds additional SIMD extensions beyond MMX.
  3. PowerPC 7xxx line (AKA G4) (IBM and Motorola)
  4. K6-3 (AMD) Last revision in K6 line, improves speed of multimedia functions and makes new clock rates available.
  5. Intel introduced SSE (Streaming SIMD Extensions) in their Pentium III series processors (as a reply to AMDs 3Dnow) which adds floating point supporting vector processing.

2000

  1. Pentium 4 (Intel) Less efficient than P3 cycle for cycle, SSE2, introduced with the Pentium 4, is a major enhancement to SSE (which some programmers renamed "SSE1"). SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX instructions to operate on 128-bit XMM registers. Until SSE4, SSE integer instructions introduced with later SSE extensions would still operate on 64-bit MMX registers because the new XMM registers require operating system support. SSE2 enables the programmer to perform SIMD math of virtually any type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to touch the (legacy) MMX/FPU registers. Many programmers consider SSE2 to be "everything SSE should have been", as SSE2 offers an orthogonal set of instructions for dealing with common data types. Bus speeds increase to as much as 533MHz in order to compete with Athlons.
  2. Athlon XP and Athlon MP (AMD) Full speed L2 cache. MP is "designed" for multiprocessor use.
  3. Crusoe TM5400 and TM5600 (Transmeta). Crusoe is a "code-morphing" processor which uses dynamic JIT recompilation to run code designed for other processors.


2001

  1. Itanium (Intel) Intel's first 64 bit CPU. Low clock rates (through 2002) but true 64 bit. Explicitly Parallel Instruction Computing (EPIC). Uses a new instruction set, IA-64, which not is based on x86. Extremely poor at emulating x86.

2002

  1. Itanium 2 (Intel) Supports higher clock rates than itanium and has a shorter pipeline to reduce the cost of a bad branch prediction.
  2. R16000 (SGI) MIPS 4 architecture, 64KB L1 and 4MB L2 cache, and with out-of-order execution (OoO.)

2003

  1. Opteron/Athlon 64 (AMD) AMD's x86-64 processors, collectively code named "Hammer". Opteron has more cache and two hyper transport (HT) links per CPU, allowing for glue-less SMP; Athlon 64 has one. A mobile (low power) version is also available. There are a number of revisions, starting with "ClawHammer" (130nm) Memory controller is on-die, so hyper transport only has to handle communication with peripherals, and memory attached to other CPUs. (NUMA architecture.)
  2. PowerPC 9xx/G5 (IBM) 64 bit PowerPC processor.

2004

  1. POWER5 (IBM) 64 bit POWER processor.

2005

  1. Athlon 64 X2 (AMD) First dual-core 64 bit desktop processor.


It’s reasonably clear that how data parallelism can be exploited on vector, SIMD and MIMD machines. Diving program into multiple tasks can achieve parallelism but in order to get finer granularity we must have to divide program at data level. It’s not only hardware which boosts data parallelism but the software also, has to be written in a way to get maximum parallelism. Recently more research going on “automatic parallelism” i.e. the automatic transformation of data parallel applications written in a standard sequential language like Fortran into SPMD message-passing programs that can be executed on DM MIMD machines. It has become clear that this can be at least partly achieved: if the required data partitioning and distribution is prescribed, a compiler can automatically partition the data and computation according to this prescription, and insert the necessary communications.

References

Parallel Computer Architecture: A Hardware/Software Approach (The Morgan Kaufmann Series in Computer Architecture and Design) by David Culler, J.P. Singh, and Anoop Gupta

http://everything2.com/index.pl?node_id=1362904
http://en.wikipedia.org/wiki/Superscalar
http://en.wikipedia.org/wiki/MMX
http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
http://en.wikipedia.org/wiki/AltiVec
http://www.gridbus.org/~raj/microkernel/chap1.pdf
http://www.vcpc.univie.ac.at/activities/tutorials/HPF/lectures/html/jhm.2.html