CSC/ECE 506 Spring 2012/1c cl: Difference between revisions

Revision as of 21:26, 30 January 2012

Introduction

In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture in which multiple processing elements execute from different instruction streams, and data is passed from one processing element to the next. The requirement that data is passed from one processing element to the next means that it is restricted to a certain type of computations, but is hard to apply in general.

Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type.

Few actual examples of this class of parallel computer have ever existed. One example of this machine is the systolic array, such as the CMU iWarp[Borkaret al., 1990]. Another prominent example of MISD in computing are the Space Shuttle flight control computers. The research on MISD concentrates on some conceivable usage: Multiple frequency filters operating on a single signal stream, Multiple cryptography algorithms, and Attempting to crack a single coded message.

Application

Network application with MISD architecture

A number of important system services need strong scaling, where a fixed workload is executed more quickly with increased core counts. This is particularly true for data-intensive, network-oriented system services because network interface card (NIC) line rates continue to increase but the individual cores that service them are not getting any faster. Replicated work approach based on a multiple-instruction/single-data (MISD) execution model seems to be a promising solution to take advantage of unexploited parallelization opportunities in multi-core systems. This approach is particularly important in the network stack, as it is needed to enable strongly scalable network services.

A multiple instruction/single data (MISD) execution model (i.e. replicated work) is used to provide ne-grained data consistency between cores and to eliminate expensive inter-core interactions. In this approach, shown in Figure 1(a), state is replicated into domains on separate cores, and requests that modify shared state are broadcast to every domain using a ringbuffer -based channel abstraction.

The replica that dequeues a request becomes the primary replica for that request and is responsible for fully processing it, including any updates with globally visible side effects (e.g. data delivery, packet reconstruction, acknowledgment generation). Other replicas that process the request will, on the other hand, only partially process each request to maintain state consistency.

This approach is particularly appropriate for many network services because it parallelizes expensive per-byte processing (e.g. data copies, encryption/decryption, and error correction) and replicates per-header processing costs that are known to be small. It also retains the reduced locking and caching costs of systems like Barrelfish while adding the ability to perform fine-grained updates to logically shared state.

To examine the potential for this approach compared to lock-based and atomic instruction-based MIMD approaches, A simple synthetic test has been constructed. In this test, processing each request requires some amount of work that can be done by one core without synchronization or replication (i.e. parallelizable work), and some work that must be synchronized or replicated. Specially, the parallelizable work is a set of memory updates done on per-core memory, while the synchronized work is the set of memory up-dates that must be performed: (a) on every replica; (b) while holding a shared lock; or (c) by using atomic memory update instructions. Figure 1(b) shows the potential performance advantages of this approach using a 10:1 ratio of parallelizable to replicated work; we chose this ratio based on our studies of production TCP/IP stacks. As can be seen, the lock-based model scales remarkably poorly and the atomic instruction approach is only slightly better.

Replicated-work paralleization architecture

Comparison of lock, atomic instruction, and replicated-work parallelization approaches

Application requires searching for multiple patterns in large data stream for which there is no preprocessed index to rely on for efficient lookups. Multiple instruction stream-single data stream (MISD) architecture that is based on a recursive divide and conquer approach to pattern matching is very suitable for searching an online data stream using queries expressed in languages that support the described functionality.

Iwarp project

Introduction

The iWarp project was started in 1988 to investigate issues involved in building and using high performance computer systems with powerful communication support. The project lead to the construction of the iWarp machines, jointly developed by Carnegie Mellon University and Intel Corporation. The basic building block of the iWarp system is a full custom VSLI component integrating a LIW microprocessor, a network interface and a switching node into one single chip of 1.2cm x 1.2cm silicon.

iWarp status

Intel Corporation has announced the iWarp systems as product in 1989 and built iWarp systems with over 1500 nodes since then. The first iWarp prototype system was delivered to Carnegie Mellon in Summer 1990 and in Fall CMU received the first 64 cell systems. All three full speed production systems were delivered in 1991. With the creation of the Intel Supercomputing Systems Division in Summer 1992 the iWarp know how was merged with the iPSC product line. Intel kept iWarp as a product but stopped actively marketing it. As of today, the start of 1995 all three iWarp systems at CMU are in still in daily use. Surprisingly there are a few applications (e.g. in real time vision) for which iWarp is still the best machine, 3 years after it has been delivered. The high speed static memory and the high performance low latency communication system make iWarp a very well suited target for research efforts and many "proof of concept" applications.

Technical data of iWarp systems Computation agent of single cell: • Processor type: 32bit RISC with 96bit LIW • Clock speed: 20MHz • Integer performance: 20 Mips • Floating point support: 32bit IEEE and 64bit IEEE • Foating point performance: 20 MFlops and 10MFlops • Network access through gates in register file up to 4 x 40 MB/s Communication agent of single cell: • Switching performance in node: 160 MB/s to 320 MB/s • Links: 4 x 40MB/s, full duplex • Logical/virtual channels: 20, configurable freely in all 4 directions and 2 pools. • Clock speed: 40MHz Memory System of single cell: • Memory configurations: o 512kBytes to 4MBytes static RAM or/and 16MBytes dynamic RAM. • Memory access latency 100ns static, 200ns dynamic • Memory access bandwidth 160 MB/s • Direct memory access agents for communication (spools): 8 Parallel system configurations: • 4 cells up to 1024 cells in n x m torus configurations. • Typical system: 64 cells, 8x8 torus, 1.2 GFlop/s peak. • Sustained performance (64 cells): o Dense matrix multiply: 1150 MFlop/s o Sparse matrix multiply: 400 MFlop/s o FFT: 700 MFlop/s

The iWarp chip (VLSI component)

The iWarp quad cell board (4 processors + memory + communication)

64 iWarp cells mounted in a 19" rack (power: 1.2 GFlop/s).

iWarp system cabinet with up to 256 cells.

CSC/ECE 506 Spring 2012/1c cl: Difference between revisions

Revision as of 21:26, 30 January 2012

Contents

Introduction

Application

Network application with MISD architecture

Iwarp project

Introduction

iWarp status

Navigation menu

@@ Line 1: / Line 1: @@
+==Introduction==
+In computing, MISD (multiple instruction, single data) is a type of parallel computing architecture in which multiple processing elements execute from different instruction streams, and data is passed from one processing element to the next. The requirement that data is passed from one processing element to the next means that it is restricted to a certain type of computations, but is hard to apply in general.
+Pipeline architectures belong to this type, though a purist might say that the data is different after processing by each stage in the pipeline. Fault-tolerant computers executing the same instructions redundantly in order to detect and mask errors, in a manner known as task replication, may be considered to belong to this type.
+Few actual examples of this class of parallel computer have ever existed. One example of this machine is the systolic array, such as the CMU iWarp[Borkaret al., 1990]. Another prominent example of MISD in computing are the Space Shuttle flight control computers. The research on MISD concentrates on some conceivable usage: Multiple frequency filters operating on a single signal stream, Multiple cryptography algorithms, and Attempting to crack a single coded message.
+==Application==
+===Network application with MISD architecture===
+A number of important system services need strong scaling, where a fixed workload is executed more quickly with increased core counts. This is particularly true for data-intensive, network-oriented system services because network interface card (NIC) line rates continue to increase but the individual cores that service them are not getting any faster. Replicated work approach based on a multiple-instruction/single-data (MISD) execution model seems to be a promising solution to take advantage of unexploited parallelization opportunities in multi-core systems. This approach is particularly important in the network stack, as it is needed to enable strongly scalable network services.
+A multiple instruction/single data (MISD) execution model (i.e. replicated work) is used to provide ne-grained data consistency between cores and to eliminate expensive inter-core interactions. In this approach, shown in Figure 1(a), state is replicated into domains on separate cores, and requests that modify shared state are broadcast to every domain using a ringbuffer -based channel abstraction.
+The replica that dequeues a request becomes the primary replica for that request and is responsible for fully processing it, including any updates with globally visible side effects (e.g. data delivery, packet reconstruction, acknowledgment generation). Other replicas that process the request will, on the other hand, only partially process each request to maintain state consistency.
+This approach is particularly appropriate for many network services because it parallelizes expensive per-byte processing (e.g. data copies, encryption/decryption, and error correction) and replicates per-header processing costs that are known to be small. It also retains the reduced locking and caching costs of systems like Barrelfish while adding the ability to perform fine-grained updates to logically shared state.
+To examine the potential for this approach compared to lock-based and atomic instruction-based MIMD approaches, A simple synthetic test has been constructed. In this test, processing each request requires some amount of work that can be done by one core without synchronization or replication (i.e. parallelizable work), and some work that must be synchronized or replicated. Specially, the parallelizable work is a set of memory updates done on per-core memory, while the synchronized work is the set of memory up-dates that must be performed:
+(a) on every replica; (b) while holding a shared lock; or (c) by using atomic memory update instructions. Figure 1(b) shows the potential performance advantages of this approach using a 10:1 ratio of parallelizable to replicated work; we chose this ratio based on our studies of production TCP/IP stacks. As can be seen, the lock-based model scales remarkably poorly and the atomic instruction approach is only slightly better.
+Replicated-work paralleization architecture
+Comparison of lock, atomic instruction, and replicated-work parallelization approaches
+Application requires searching for multiple patterns in large data stream for which there is no preprocessed index to rely on for efficient lookups. Multiple instruction stream-single data stream (MISD) architecture that is based on a recursive divide and conquer approach to pattern matching is very suitable for searching an online data stream using queries expressed in languages that support the described functionality.
+===Iwarp project===
+====Introduction====
+The iWarp project was started in 1988 to investigate issues involved in building and using high performance computer systems with powerful communication support. The project lead to the construction of the iWarp machines, jointly developed by Carnegie Mellon University and Intel Corporation. The basic building block of the iWarp system is a full custom VSLI component integrating a LIW microprocessor, a network interface and a switching node into one single chip of 1.2cm x 1.2cm silicon.
+====iWarp status====
+Intel Corporation has announced the iWarp systems as product in 1989 and built iWarp systems with over 1500 nodes since then. The first iWarp prototype system was delivered to Carnegie Mellon in Summer 1990 and in Fall CMU received the first 64 cell systems. All three full speed production systems were delivered in 1991. With the creation of the Intel Supercomputing Systems Division in Summer 1992 the iWarp know how was merged with the iPSC product line. Intel kept iWarp as a product but stopped actively marketing it. As of today, the start of 1995 all three iWarp systems at CMU are in still in daily use. Surprisingly there are a few applications (e.g. in real time vision) for which iWarp is still the best machine, 3 years after it has been delivered. The high speed static memory and the high performance low latency communication system make iWarp a very well suited target for research efforts and many "proof of concept" applications.
+Technical data of iWarp systems
+Computation agent of single cell:
+•	Processor type: 32bit RISC with 96bit LIW
+•	Clock speed: 20MHz
+•	Integer performance: 20 Mips
+•	Floating point support: 32bit IEEE and 64bit IEEE
+•	Foating point performance: 20 MFlops and 10MFlops
+•	Network access through gates in register file up to 4 x 40 MB/s
+Communication agent of single cell:
+•	Switching performance in node: 160 MB/s to 320 MB/s
+•	Links: 4 x 40MB/s, full duplex
+•	Logical/virtual channels: 20, configurable freely in all 4 directions and 2 pools.
+•	Clock speed: 40MHz
+Memory System of single cell:
+•	Memory configurations:
+o	512kBytes to 4MBytes static RAM or/and 16MBytes dynamic RAM.
+•	Memory access latency 100ns static, 200ns dynamic
+•	Memory access bandwidth 160 MB/s
+•	Direct memory access agents for communication (spools): 8
+Parallel system configurations:
+•	4 cells up to 1024 cells in n x m torus configurations.
+•	Typical system: 64 cells, 8x8 torus, 1.2 GFlop/s peak.
+•	Sustained performance (64 cells):
+o	Dense matrix multiply: 1150 MFlop/s
+o	Sparse matrix multiply: 400 MFlop/s
+o	FFT: 700 MFlop/s
+The iWarp chip (VLSI component)
+The iWarp quad cell board (4 processors + memory + communication)
+iWarp cells mounted in a 19" rack (power: 1.2 GFlop/s).
+iWarp system cabinet with up to 256 cells.

CSC/ECE 506 Spring 2012/1c cl: Difference between revisions

Revision as of 21:26, 30 January 2012

Introduction

Application

Network application with MISD architecture

Iwarp project

Introduction

iWarp status

Navigation menu

Search