CSC/ECE 506 Fall 2007/wiki2 3 pb: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
Line 91: Line 91:
In the figure above, the gate falls across the cluster boundaries. In this case, 4 clusters may synchronize operating as a larger SIMD patch, but the throughput of the calculations in reduced.
In the figure above, the gate falls across the cluster boundaries. In this case, 4 clusters may synchronize operating as a larger SIMD patch, but the throughput of the calculations in reduced.


==CALCULATIONS==
==Calculations==


Each cluster within the pyramid performs a separate validation operation in parallel. The MIMD processor within each cluster also updates the filter information for a separate track in parallel. The whole process can be divided into 3 phases:-
Each cluster within the pyramid performs a separate validation operation in parallel. The MIMD processor within each cluster also updates the filter information for a separate track in parallel. The whole process can be divided into 3 phases:-
Line 103: Line 103:
Resulting in the MIMD processors and the MSIMD arrays operating concurrently. We can see from the table below that the processing time on the MIMD and the MSIMD arrays are efficiently overlapped with little wastage.
Resulting in the MIMD processors and the MSIMD arrays operating concurrently. We can see from the table below that the processing time on the MIMD and the MSIMD arrays are efficiently overlapped with little wastage.


==SUMMARY==
==Summary==


The paper describes the use of Pyramid Machine in the application of tracking features through image sequences. The structure of the architecture integrates processor types which are appropriate to the differing parts of the overall process. The efficiency of the computations largely increases by using MSIMD architecture. The data bandwidth required between the data in image form and the tracking filters has been achieved by providing multiple data paths between MSIMD and MIMD arrays.
The paper describes the use of Pyramid Machine in the application of tracking features through image sequences. The structure of the architecture integrates processor types which are appropriate to the differing parts of the overall process. The efficiency of the computations largely increases by using MSIMD architecture. The data bandwidth required between the data in image form and the tracking filters has been achieved by providing multiple data paths between MSIMD and MIMD arrays.

Revision as of 02:53, 29 September 2007

The Problem

MSIMD architectures have garnered quite a bit of contention recently. Read a few papers on these architectures and write a survey of applications for which they would be suitable. If possible, talk about the steps in parallelizing these applications (decomposition, assignment, orchestration, and mapping).

Introduction

The two major classes of architectures SIMD and MIMD were once clearly separated. These have been superseded by hybrid architectures which combine the features of both the classes. One form of such architectures is Multi SIMD or MSIMD, combines the generality that control parallelism provides in MIMD with the efficiency of data parallelism from SIMD systems. This combination of generality and efficiency makes MSIMD architectures strong candidates for the next generation of massively parallel supercomputers. The evolution of parallel computers can be illustrated from the figure below


http://www4.ncsu.edu/~sbhatia2/images/wiki/img1.jpg

Source: “The GPA Machine: A generally partitionable MSIMD Architecture”

The interesting parts in this diagram are along the control parallelism edges of the graph. The evolution of Van Neumann to MIMD machine requires the introduction of an interconnection network, which extends the functionality of the RAM module from a single ported RAM device to a multi ported RAM device. Similarly a single port SIMD machine must be extended to a multi port SIMD device if control parallelism is addded

3-D Wafer Stack Neurocomputing

Source: 3-D Wafer Stack Neurocomputing Michael L. Campbell, Hughes Reasearch laboratories Scott T. Toborg Hughes Space and Communication Group

The authors have introduced a family of massively parallel MSIMD architectures which can be configured to efficiently handle a variety of different neural network models. The underlying technology is Three dimensional wafer scale integration (3-D WSI), which provides an ideal medium to construct powerful, compact and low power hardware tailored for neural network processing. Many different types of parallel architectures have been proposed for implementing neural networks. While some of these approaches come close to meeting the computational requirements for neural networks, few meet the needs imposed on weight, size and power mandated by real-time embedded applications.

In the paper mentioned above, the authors have illustrated the utility of the 3-D WSI approach by describing the design of a family of massively parallel bit-serial Multiple-SIMD (MSIMD) array processors tailored for processing different neural network models.

Application: Neural Network Vision

Computer Vision deals with the automatic construction of 3-D scene descriptions from sparse, noisy 2-D image data. Information contained in a 2-D image is generally insufficient to reconstruct a unique set of 3-D surface properties. Vision integration is one technique for combining constraints from different data sources like motion, texture, color to improve computation of 3-D surface properties. In this approach, information from one vision module is used to guide the calculation of image properties in another module. The process can also be considered as cooperative in the sense that vision modules can mutually and interactively affect each others computations.

The algorithms to implement the above were initially developed using a massively parallel bit-serial SIMD array processor, the AMT DAP 610C. While the DAP typically provides a speed-up of over 300 times the processing time of a serial workstation (Sun4), it is found to be still too slow by over an order of magnitude for real-time processing.

The vision integration problem has three significant levels of parallelism

  1. Spatial—Neurons representing the image estimates are processed in parallel.
  2. Procedural—Neuron update equations can be decomposed and parts can be computed in parallel.
  3. Modular—Each vision module may be processed independent of other modules.
  4. Benchmarks and timing estimates for various serial and parallel machines on 200 iterations for surface reconstruction
    1. SUN 4: 171.34s
    2. DAP 610C: 0.52s
    3. 3D/16: 0.33s
    4. 3D/26-par:0.042s

http://www4.ncsu.edu/~sbhatia2/images/wiki/img2.jpg

  1. Benchmark and timing estimates for vision integration operations
    1. DAP 610C:1.58s
    2. 3D/60 – multi:0.048s

http://www4.ncsu.edu/~sbhatia2/images/wiki/img3.jpg

http://www4.ncsu.edu/~sbhatia2/images/wiki/img4.jpg

In this paper, it has been shown how 3-D WSI can be used along with MSIMD to construct digital systems with a very high throughput per power volume and extremely high communication bandwidth.

An MSIMD Architecture for Feature Tracking

D.J. Kerbyson, T.J. Atherton, G.R. Nudd University of Warwick, England

In this paper the author describes the use of heterogeneous Multiple- SIMD (MSIMD) for the tracking of features within the image. Features are detected within the image and tracked through subsequent images. Tracking filters with appropriate models run in parallel with the image processing on our architecture.

Application: Tracking features of images within cluttered environments are needed mostly for military and police use and sometimes for commercial use. This is best done by mapping these techniques on parallel architectures. This architecture has to process different forms and structures of data, starting with the image data, progressing through numeric data to symbolic representations.

Typical architectures can be classified either as SIMD or MIMD. SIMD architectures are well suited to data parallel problems where all elements, in a large array of data, are processed identically. Usually raw image data is of this type. MIMD architectures are effective where a number of data elements are to be processed independently (control parallel) which is typical of the requirement for object based or symbolic representation. This difference between the two array types has therefore limited their application to a single level of the required processing. A pyramid structure is implemented with is a fusion of both SIMD and MIMD processor arrays to give an enhanced processor performance.

This pyramid structure can be viewed as forming Multiple SIMD (MSIMD) arrays at the bottom level. Each of these smaller MSIMD is termed as a clusters and are connected together in a modular way producing a machine that is scalable with the number of clusters. http://www4.ncsu.edu/~sbhatia2/images/wiki/img5.jpg


Individual clusters can operate autonomously, or in synchrony with others forming larger MSIMD patches or, if all clusters are synchronised, an SIMD array. An additional advantage of the M-SIMD architecture is its multiple vertical data paths giving higher bandwidth between the SIMD array and the MIMD array. This can be seen as below

http://www4.ncsu.edu/~sbhatia2/images/wiki/img7.jpg



MULTIPLE TRACKING FEATURE
http://www4.ncsu.edu/~sbhatia2/images/wiki/img8.jpg


The input image undergoes some application specific processing like enhancement. The data association process is undertaken between the predicted feature positions and these measurements are integrated into the tracking filters to provide information about the movement of the feature through the image sequence. The number of features can range from 10’s to 10000’s.

Each cluster within the machine is responsible for a region of the image, either as one to one mapping between processing element and image pixel or several pixels to a single processing element. The validations and assignment calculations can be performed in parallel for all measurements on a conventional SIMD array for a particular track. The calculation is limited to a small area of the image place resulting in reduced utilization of the SIMD processor array. Using the MSIMD partitioning within the pyramid, is it possible to increase the utilization of the array, by a factor linear to the number of clusters in the pyramid.

The number of these calculations operating in parallel across the MSIMD array also depends on the size of the validation gates and whether they fit fully within the boundaries of a cluster
http://www4.ncsu.edu/~sbhatia2/images/wiki/img9.jpg


In the figure above, the gate falls across the cluster boundaries. In this case, 4 clusters may synchronize operating as a larger SIMD patch, but the throughput of the calculations in reduced.

Calculations

Each cluster within the pyramid performs a separate validation operation in parallel. The MIMD processor within each cluster also updates the filter information for a separate track in parallel. The whole process can be divided into 3 phases:-

Phase1 : prediction from the MIMD
Phase2: use of predictions for validation (MSIMD)
Phase3: resultant data association information used to update filter (MIMD)

http://www4.ncsu.edu/~sbhatia2/images/wiki/img6.jpg

Resulting in the MIMD processors and the MSIMD arrays operating concurrently. We can see from the table below that the processing time on the MIMD and the MSIMD arrays are efficiently overlapped with little wastage.

Summary

The paper describes the use of Pyramid Machine in the application of tracking features through image sequences. The structure of the architecture integrates processor types which are appropriate to the differing parts of the overall process. The efficiency of the computations largely increases by using MSIMD architecture. The data bandwidth required between the data in image form and the tracking filters has been achieved by providing multiple data paths between MSIMD and MIMD arrays.

Resource Optimization of a Parallel Computer for Multiple Vector Processing

KAI HWANG AND LIONEL M. NI

The authors have described Performance optimization of a shared-resource parallel computer containing multiple control units (CU's) sharing a resource pool of processing elements (PE's) and operating with multiple single-instruction-multiple-data (MSIMD) streams.

Application

The authors have proposed a queuing model for MSIMD machines used in multiple array processing.

In single-instruction and multiple-data (SIMD)machine all the processing elements (PE's) execute the same instruction over different data sets under the control of a control unit (CU). An MSIMD machine is composed of multiple CU's sharing a finite number of PE's through interconnection networks. The organizations of SIMD and MSIMD machines are compared in Fig. 1 below.

http://www4.ncsu.edu/~sbhatia2/images/wiki/img11.jpg


For purpose of simulation, the authors assume:

  1. A completely accessible switching network to provide all possible interconnections between CU's and PE's.
  1. The subset of PE's allocated to a given CU to vary in size according to the job
  2. The time required to allocate PE's to a designated CU is considered part of the CU service time. The analysis is based on modeling the multiprocessor structure as an MjMjK/1 queueing family, where M and M represent the exponential interarrival time

and the exponential service time, respectively.

The Queueing Model for multiple-SIMD

The MSIMD organizations can be modeled by the queuing network General characteristics of the model

  1. Servers (CU's) and Shared-Resource Pool (PE's)

Each of the CU's handles an independent instruction stream.

  1. Job Arrival Process and Vector Job Characteristics

The arrival of the vector jobs forms Poisson process

  1. Maximum Queue Capacity (I) and Queuing Discipline

For an infinite queue, every arriving vector job is allowed to wait in the queue until service can be provided. First-come-first-serve (FCFS) queuing discipline is assumed in our analysis.

Results

The author carried out a series of computer simulation experiments

http://www4.ncsu.edu/~sbhatia2/images/wiki/img12.jpg

  1. When 1] has a value far away from rim, the system utilization will be degraded.
  2. The analytic results for performance measures are shown in solid lines. These analytic results are very close to those obtained from the direct simulations shown by dash lines.
  3. With very large standard deviations (like a; = 256 in Fig. 3), the enlarged differences between analytic and simulation results are caused by the unbalanced demands of PE's.


Simulation of a MSIMD System with Resequencing

Helen D. Karatza

The author studies the performance of a MSIMD computer system network, where resequencing of SIMD jobs after the CUs service ensures that jobs leave the CUs on a first-in-first-out basis.

http://www4.ncsu.edu/~sbhatia2/images/wiki/img10.jpg

The configuration of the queueing network model is shown above. The network consists of two units: the processing unit (CUs and PEs) and the I/O channel.

A fixed number of vector jobs N is circulated alternately within the system between the processing unit and the I/O channel. The departure of jobs from the processing unit ia assumed to be the same order as their arrivals. Therefore, after completion jobs who go out of order are forced to wait in the resequencing buffer until the overtaken job.

The author further assumes that the number of PEs required by a vector job is a random variable, called the PEdemand. The PE-demand of a vector job is assumed to remain constant throughout the vector process. However, vector jobs have independent and identical PEdemand distribution F. The authors also adopted two different queueing disciplines at the PU.

  1. First-Come-First-Served (FCFS)
  2. Least-PE-Demand- First-Served (LDFS).

Result of Simulation

This model contained 2 Control Units (CUs) which share a resource pool of 128 Processing Elements (PEs) and executes multiple SIMD jobs.

From the simulation results it is shown: LDFS reduced significantly the mean response time. The performance improvement due to LDFS discipline for the best results were obtained for C =2 at high N.

However the advantages over FCFS were not completely exploited, due to resequencing delays of jobs and partly to subsequent queueing delays at the I/O unit. Due to complexity of the LDFS, the authors conluded that the FCFS is preferable.

References

1.Resource Optimization of a Parallel Computer for Multiple Vector Processing Hwang, K.; Ni, L.M.; Transactions on Computers Volume C-29, Issue 9, Sept. 1980 Page(s):831 - 836

2.Simulation of a MSIMD system with resequencing Karatza, H.D.; Massively Parallel Computing Systems, 1994., Proceedings of the First International Conference on 2-6 May 1994 Page(s):339 - 342

3.Simulating PRAM with a MSIMD model (ASC) Ulm, D.; Baker, J.W.; Parallel Processing, 1998. Proceedings. 1998 International Conference on 10-14 Aug. 1998 Page(s):3 - 10

4.3D wafer stack neurocomputing M.L. Campbell; S.T. Toborg; S.L. Taylor; Wafer Scale Integration, 1993. Proceedings., Fifth Annual IEEE International Conferenc e on1993 Page(s):67 - 74

5.An MSIMD architecture for feature tracking Kerbyson, D.J.; Atherton, T.J.; Nudd, G.R.; Medium Grain Distributed Computing, IEE Colloquium on 26 Mar 1992 Page(s):7/1 - 7/6