CSC/ECE 506 Spring 2012/1c cl: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
No edit summary
No edit summary
Line 27: Line 27:
Technical data of iWarp systems
Technical data of iWarp systems
Computation agent of single cell:  
Computation agent of single cell:  
• Processor type: 32bit RISC with 96bit LIW  
• Processor type: 32bit RISC with 96bit LIW  
• Clock speed: 20MHz  
• Clock speed: 20MHz  
• Integer performance: 20 Mips  
• Integer performance: 20 Mips  
• Floating point support: 32bit IEEE and 64bit IEEE  
• Floating point support: 32bit IEEE and 64bit IEEE  
• Foating point performance: 20 MFlops and 10MFlops  
• Foating point performance: 20 MFlops and 10MFlops  
• Network access through gates in register file up to 4 x 40 MB/s  
• Network access through gates in register file up to 4 x 40 MB/s  
Communication agent of single cell:  
Communication agent of single cell:  
• Switching performance in node: 160 MB/s to 320 MB/s  
• Switching performance in node: 160 MB/s to 320 MB/s  
• Links: 4 x 40MB/s, full duplex  
• Links: 4 x 40MB/s, full duplex  
• Logical/virtual channels: 20, configurable freely in all 4 directions and 2 pools.  
• Logical/virtual channels: 20, configurable freely in all 4 directions and 2 pools.  
• Clock speed: 40MHz  
• Clock speed: 40MHz  
Memory System of single cell:  
Memory System of single cell:  
• Memory configurations:  
• Memory configurations:  
o 512kBytes to 4MBytes static RAM or/and 16MBytes dynamic RAM.  
o 512kBytes to 4MBytes static RAM or/and 16MBytes dynamic RAM.  
• Memory access latency 100ns static, 200ns dynamic  
• Memory access latency 100ns static, 200ns dynamic  
• Memory access bandwidth 160 MB/s  
• Memory access bandwidth 160 MB/s  
• Direct memory access agents for communication (spools): 8  
• Direct memory access agents for communication (spools): 8  
Parallel system configurations:  
Parallel system configurations:  
• 4 cells up to 1024 cells in n x m torus configurations.  
• 4 cells up to 1024 cells in n x m torus configurations.  
• Typical system: 64 cells, 8x8 torus, 1.2 GFlop/s peak.  
• Typical system: 64 cells, 8x8 torus, 1.2 GFlop/s peak.  
• Sustained performance (64 cells):  
• Sustained performance (64 cells):  
o Dense matrix multiply: 1150 MFlop/s  
o Dense matrix multiply: 1150 MFlop/s  
o Sparse matrix multiply: 400 MFlop/s  
o Sparse matrix multiply: 400 MFlop/s  
o FFT: 700 MFlop/s  
o FFT: 700 MFlop/s  
iWarp status
iWarp status
Intel Corporation has announced the iWarp systems as product in 1989 and built iWarp systems with over 1500 nodes since then. The first iWarp prototype system was delivered to Carnegie Mellon in Summer 1990 and in Fall CMU received the first 64 cell systems. All three full speed production systems were delivered in 1991. With the creation of the Intel Supercomputing Systems Division in Summer 1992 the iWarp know how was merged with the iPSC product line. Intel kept iWarp as a product but stopped actively marketing it. As of today, the start of 1995 all three iWarp systems at CMU are in still in daily use. Surprisingly there are a few applications (e.g. in real time vision) for which iWarp is still the best machine, 3 years after it has been delivered. The high speed static memory and the high performance low latency communication system make iWarp a very well suited target for research efforts and many "proof of concept" applications.
Intel Corporation has announced the iWarp systems as product in 1989 and built iWarp systems with over 1500 nodes since then. The first iWarp prototype system was delivered to Carnegie Mellon in Summer 1990 and in Fall CMU received the first 64 cell systems. All three full speed production systems were delivered in 1991. With the creation of the Intel Supercomputing Systems Division in Summer 1992 the iWarp know how was merged with the iPSC product line. Intel kept iWarp as a product but stopped actively marketing it. As of today, the start of 1995 all three iWarp systems at CMU are in still in daily use. Surprisingly there are a few applications (e.g. in real time vision) for which iWarp is still the best machine, 3 years after it has been delivered. The high speed static memory and the high performance low latency communication system make iWarp a very well suited target for research efforts and many "proof of concept" applications.

Revision as of 20:50, 30 January 2012

Introduction

In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture in which multiple processing elements execute from different instruction streams, and data is passed from one processing element to the next. The requirement that data is passed from one processing element to the next means that it is restricted to a certain type of computations, but is hard to apply in general. Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type.

Examples of parallel computer with MISD architechture

Few actual examples of this class of parallel computer have ever existed. One example of this machine is the systolic array, such as the CMU iWarp[Borkaret al., 1990]. Another prominent example of MISD in computing are the Space Shuttle flight control computers. The research on MISD concentrates on some conceivable usage: Multiple frequency filters operating on a single signal stream, Multiple cryptography algorithms, and Attempting to crack a single coded message.


Application

in Network

A number of important system services need strong scaling, where a fixed workload is executed more quickly with increased core counts. This is particularly true for data-intensive, network-oriented system services because network interface card (NIC) line rates continue to increase but the individual cores that service them are not getting any faster. Replicated work approach based on a multiple-instruction/single-data (MISD) execution model seems to be a promising solution to take advantage of unexploited parallelization opportunities in multi-core systems. This approach is particularly important in the network stack, as it is needed to enable strongly scalable network services.

A multiple instruction/single data (MISD) execution model (i.e. replicated work) is used to provide ne-grained data consistency between cores and to eliminate expensive inter-core interactions. In this approach, shown in Figure 1(a), state is replicated into domains on separate cores, and requests that modify shared state are broadcast to every domain using a ringbuffer -based channel abstraction. The first replica that dequeues a request becomes the primary replica for that request and is responsible for fully processing it, including any updates with globally visible side effects (e.g. data delivery, packet reconstruction, acknowledgment generation). Other replicas that process the request will, on the other hand, only partially process each request to maintain state consistency. This approach is particularly appropriate for many network services because it parallelizes expensive per-byte processing (e.g. data copies, encryption/decryption, and error correction) and replicates per-header processing costs that are known to be small. It also retains the reduced locking and caching costs of systems like Barrelfish while adding the ability to perform fine-grained updates to logically shared state.

To examine the potential for this approach compared to lock-based and atomic instruction-based MIMD approaches, A simple synthetic test has been constructed. In this test, processing each request requires some amount of work that can be done by one core without synchronization or replication (i.e. parallelizable work), and some work that must be synchronized or replicated. Specially, the parallelizable work is a set of memory updates done on per-core memory, while the synchronized work is the set of memory up-dates that must be performed: (a) on every replica; (b) while holding a shared lock; or (c) by using atomic memory update instructions. Figure 1(b) shows the potential performance advantages of this approach using a 10:1 ratio of parallelizable to replicated work; we chose this ratio based on our studies of production TCP/IP stacks. As can be seen, the lock-based model scales remarkably poorly and the atomic instruction approach is only slightly better.




Iwarp project

The iWarp project was started in 1988 to investigate issues involved in building and using high performance computer systems with powerful communication support. The project lead to the construction of the iWarp machines, jointly developed by Carnegie Mellon University and Intel Corporation. The basic building block of the iWarp system is a full custom VSLI component integrating a LIW microprocessor, a network interface and a switching node into one single chip of 1.2cm x 1.2cm silicon. Technical data of iWarp systems Computation agent of single cell:

• Processor type: 32bit RISC with 96bit LIW

• Clock speed: 20MHz

• Integer performance: 20 Mips

• Floating point support: 32bit IEEE and 64bit IEEE

• Foating point performance: 20 MFlops and 10MFlops

• Network access through gates in register file up to 4 x 40 MB/s

Communication agent of single cell:

• Switching performance in node: 160 MB/s to 320 MB/s

• Links: 4 x 40MB/s, full duplex

• Logical/virtual channels: 20, configurable freely in all 4 directions and 2 pools.

• Clock speed: 40MHz

Memory System of single cell:

• Memory configurations:

o 512kBytes to 4MBytes static RAM or/and 16MBytes dynamic RAM.

• Memory access latency 100ns static, 200ns dynamic

• Memory access bandwidth 160 MB/s

• Direct memory access agents for communication (spools): 8

Parallel system configurations:

• 4 cells up to 1024 cells in n x m torus configurations.

• Typical system: 64 cells, 8x8 torus, 1.2 GFlop/s peak.

• Sustained performance (64 cells):

o Dense matrix multiply: 1150 MFlop/s

o Sparse matrix multiply: 400 MFlop/s

o FFT: 700 MFlop/s

iWarp status Intel Corporation has announced the iWarp systems as product in 1989 and built iWarp systems with over 1500 nodes since then. The first iWarp prototype system was delivered to Carnegie Mellon in Summer 1990 and in Fall CMU received the first 64 cell systems. All three full speed production systems were delivered in 1991. With the creation of the Intel Supercomputing Systems Division in Summer 1992 the iWarp know how was merged with the iPSC product line. Intel kept iWarp as a product but stopped actively marketing it. As of today, the start of 1995 all three iWarp systems at CMU are in still in daily use. Surprisingly there are a few applications (e.g. in real time vision) for which iWarp is still the best machine, 3 years after it has been delivered. The high speed static memory and the high performance low latency communication system make iWarp a very well suited target for research efforts and many "proof of concept" applications.