CSC/ECE 506 Fall 2007/wiki1 9 vr

Definition

Array processing is a CPU design concept which uses multiple interconnected processing elements to execute the same instruction on different data items. A control processor dispatches a single instruction stream to each of these processing elements containing a processor and a memory. Communication between the processing elements is achieved by interconnecting the nodes.

Vector Processing can be described an alternative model for exploiting instruction level parallelism (ILP) where multiple data elements contained in vector registers are processed using pipelined vector functional units. A vector register is a linear array of n-bit data elements each of a definite width and the pipelined units perform arithmetic operations on the data elements in parallel.

History

Flynn’s Taxonomy classifies computers based on the number of concurrent instructions and data streams available for execution. Array and Vector Processing come under a category called SIMD (Single Instruction Multiple Data), which means multiple data streams are processed for the same instruction.

During 1960’s, fear of performance stagnation pushed computer architects to look for smarter alternatives to increase the throughput. Daniel Slotnick, a Professor from Computer Science Department of University of Illinois, proposed a conceptual SIMD machine called “SOLOMON”, with 1024 1-bit processing elements each having a memory capacity for 128 32–bit values. The machine was never built, but the design was a starting point for the advanced computer called ILLIAC-IV. It had 64 processing elements each with a memory capacity of 2,048 words of 64 bit length. These elements communicated with each through an interconnect that resembled a ring. Each element was is provided a direct data path to four other elements, its immediate right and left neighbors and the neighbors spaced eight elements away. This interconnection structure is wrapped around, so that PE 63(Processing Element 63)is directly connected to PE 0(Processing Element 0).

The first successful implementation of vector processing architecture was CDC STAR- 100, from Central Data Corporation. It used vector registers to hold multiple data elements. It had a high startup time and relatively slow. Vector architectures exhibited SIMD behavior by having operations that applied to all elements in a vector register.

These parallel computing architectures tried to exploit inherent data parallelism in programs.

Description

An array processor, usually has multiple processing elements each capable of performing arithmetic/logical operations and store the result in its memory. Parallelism is achieved by operating on a stream of data rather than a single element. A control processor is responsible for fetching and broadcasting the instruction which will be executed by the PEs (Processing Elements). Array processing provides more performance than a serial computing.

For example a computation that takes an array of elements and performs some operation on it would require a serial computer to process one element at a time. However an array processor does this by distributing the array elements among the PEs. Each PE may be assigned an element in an array or a set of rows. The instruction dispatched by the control unit will be executed by the PE’s which communicate with each other. It is needless to say that this computing technique is suited for matrix multiplications and array operations which are extensively used in statistical analyzes, numerical linear algebra, numerical solution of partial differential equations and digital signal processing calculations.

More information on the same at http://www.lib.ncsu.edu:2162/citation.cfm?id=808415&coll=portal&dl=ACM&CFID=28753767&CFTOKEN=45906295

Vector Processing as the name itself suggests uses vectors i.e. a series of values or elements than a scalar i.e. single value or an element. Vector Processors typically have

• Vector Registers

• Vector Functional Units

• Scalar Units with registers and data paths,

• Vector Load Store Units

• Interconnect which is used for communication.

Each vector register is capable of holding multiple data elements of a definite width. A typical system would have a number of such registers. The load/store units are responsible for fetching operands and writing the results into the memory. The pipelined functional units perform arithmetic and logical operations. All of these operate on a series of values which are either residing in the main memory or in the registers. Cray Y-MP, a supercomputer built by Cray Inc used vector processing to increase the performance.

Consider an example of multiplication of two arrays on a vector processor. The operands which are elements of the input arrays are loaded into two vector registers. The first elements of each of these vector registers are fed into a pipelined multiplication unit which performs the operation and stores the result in another vector register. All functional units are pipelined so that the overall execution time is low. The results are written back to the main memory using the load/store unit.

More information on the same at http://www.pcc.qub.ac.uk/tec/courses/cray/ohp/CRAY-slides_3.html.

Importance and Trends

National Energy Research Scientific Computing Center (NERSC) at Berkeley, California has collaborations with the computer and computational science departments for several universities. This organization also ranks the most powerful supercomputers in the world based on Rmax (a benchmark from Linpack). The listings are available at http://www.top500.org.

Horst D Simon, the Director of NERSC Center, presented on the trends in supercomputing in December 2003. As indicated in his presentations global climate modeling and earth simulators are few of the computationally intensive examples which need supercomputers. Earth Simulator, developed for Japan Aerospace Exploration Agency is a highly parallel vector supercomputer with 640 processor nodes connected by 640x640 single-stage crossbar switches. Each node consisted of 8 vector type arithmetic processor and 16 GB memory with a peak performance of 8Gflops per vector processor. It could run holistic simulations of atmosphere and oceans down to the resolution of 10 km.

Apart from the domain of supercomputers, vector processors find application in multimedia processing which is computationally intensive and places large demands on portable devices. These functions have streaming data and perform same operation on multiple elements.The functional units are deeply pipelined to exploit ILP (Instruction Level Parallelism). Multiple load/store units take advantage of this nature of inputs. Motorola has done a lot of research of applying these techniques for multimedia processing.

More information can be obtained at http://www.lib.ncsu.edu:2162/citation.cfm?id=956540&coll=portal&dl=ACM&CFID=28753767&CFTOKEN=45906295

Future

Computational simulation is one of the areas which require supercomputing. The scientific challenges including understanding, detecting and predicting the human influence on climate and modeling the full earth system including atmosphere, ocean, land and their interactions can only be done through supercomputers.

Several companies like IBM, Cray Inc and SGI are doing pioneering research in supercomputing which continues to scale towards new heights.