Introduction

This article features and answers the following queries:

Any changes in the organization of address spaces in the last 10 years?
Are the interconnection structures different in new computers now than they were 10 years ago?
What is the size and capacity of current SMPs?
How have supercomputers evolved since the Cray T3E?

Shared Address Space

In the parallel computing world, this is the range of Memory addresses accessed / shared by multiple processors. "Shared Memory Multi-Processors" is a class of parallel machines which use Shared Address Space for parallelisation.

Trends in Organisation of address spaces

Typical usage of the address space is by a single processor to store data and instructions. As the need for faster processing of the data and instructions grew, Processors became more and more powerful, along with better organisation of the available memory space. Soon, we had multiple processors sitting on the same chip along with the either dedicated memories or with shared memories. To utilise memory better, different memory accessing schemes evolved ; Real Memory and Virtual Memory. Real memory access is a sequential access of the memory, where memory addresses are accessed one after the other. Such a scheme allowed utilisation of only the available address space. Virtual Memory has also evolved over the years. In the initial years, Memory access was done at two-levels. One of RAM and the other of either the hard-disk or the tape-drive using a technique called overlaying. Overlaying is a technique of replacing an unwanted block in RAM with a block which is required for the current execution.

An improvement of the overlaying technique is what is today called Paging. An intermediate technique called Segmentation also existed, but Segmentation had major drawbacks, especially when the segments were too large to handle.

All current virtual memory principles are based on Paging. Paging is a technique of breaking down the data into "Pages" (for example Pages of size 4KB are quite typical, though pages sizes of even 4MB exist! http://www.x86.org/articles/4mpages/4moverview.htm), and then loading them onto the RAM from the lower memory as and when required. A "Translation Lookaside Buffer" (TLB) takes care of all the virtual to physical memory mappings, and using this table, the required data is transferred into the Main memory for access.

Virtual memory, by way of combining the main memory with the lower memories like the hard-drive and the flash-drives, gives the user the feel of having virtually "infinite" memory addresses, though the primary memory might be limited to a particular Memory size.

Pentium can access about 4GB of physical Memory and about 64TB of virtual Memory Physical access has now become virtual access, where the user will feel that the memory to be accessed is infinite, since Virtual Memory mechanism includes the lower elements in the memory hierarchy for expanding memory addresses.

If an application is run on a 32-bit OS, the maximum number of Address locations that can be accessed for a process created is about 4GB. But the physical memory in a typical Desktop might be about 1GB of RAM. To better utilise the full spectrum of address space values accessible by the OS, Virtual Memory concept is put into use, where the remaining 3GB address space is "borrowed" from one or more of the lower memories like the hard-disk.

Interconnection structure

A single processor and a single memory would probably have the simplest of connections between them; where a single dedicated line will exist for data and code to be shared between the two. Now imagine a situation where there are multiple processors and a single memory, or multiple processors and multiple memory chips, or a single processor and multiple memory chips - In each of these cases, it is necessary to come up with some sort of a sane connection structure which will allow the processor(s) and the memory(ies) to "talk" to each other and that too effectively - to give the best results in terms of quick access and execution; Larger the number of processors or memory chips, more difficult is the design of such a connection structure - considering cost, power, efficiency, etc. All these design and considerations form part of what is called the "Interconnection structure".

Interconnection structures for accessing memory have changed over the last 10 years. There are several kinds of structures available for accessing memory.

Bus-based
Cross-bar
Extended
Distributed Memory Access
- Hyper-cube interconnection
- Mesh interconnection

We will discuss a few here: A bus-type interconnect is a shared system interconnect, where multiple processors are connected together over a bus, the other side of which lie all the memories connected together. Such an arrangement allows the number of interconnects to be very small, in fact just the bus. Such interconnects are typically high-speed connections limited to about 50-66MHz due to transmission line effects. E.g. Pentium Pro can go up to 528 MB/sec for 64-bit bus at 66 MHz. The cost of such an interconnect would be in the order of O(p) switches, where 'p' is the number of processors.

Cross-bar would be a typical way of providing each processor access to each of Memory chips available, and vice-versa. Such an interconnect structure would be difficult to implement when the number of processors go up. Cost of such a system would be of the order of O(pXm) switches, where 'p' is the number of processors and 'm' is the number of memories.

Crossbar Memory Architecture was originally used on mainframe computers to increase memory bandwidth in multi-processor systems, but later the technology was brought down to the server and workstation platforms. Crossbar switches have also been designed to link entire computer systems as well.

A memory crossbar can eliminate bottlenecks associated with existing memory architecture as it replaces the conventional system bus architecture. Instead of sharing a bus, communication between the processor and the memory uses dedicated connections

A compromise between the above two types is what is called the multi-stage interconnection network. Typical examples would be hypercubes, omega networks, Mesh connections, etc. In such an interconnect, the processors are on one side and the memory and I/O on the other side, with many stages of switches in between, which actually pass the information between the processors and the memories. Such an interconnect structure has cost in the order of O(p log p) switches, where 'p' is the number of switches.

Such a multi-stage interconnection network has a performance better than a bus-based interconnection structure, with cost being lower than that of a cross-bar interconnect.

Current SMPs (Symmetric Multiprocessing)

A Computer system containing 2 or more processors in the same box, with shared memory, but containing just one OS running on them is termed as a Symmetric multiprocessor system. The downtime of such a system is dependent on the weakest link, ie. the single processors; if one processor is down, the whole system is said to be down.

SMPs rely strongly on a good and well-designed Operating System to take care of load-balancing between the multiprocessors, since the assignment of tasks is done solely by the OS. The better the OS, better is the load-balancing, better is the efficient operation of the SMP.

One big advantage of an SMP System is its scalability ; additional processors can be added as needed. Having said that, processors themselves are the biggest disadvantage as well in such a system. If one of the many processors shuts down for some reason, there is no way for the OS to get to know of this failure, and it continues to assign tasks to the "dead" processor, and slows down the execution. This is where the current trend of strengthening the software to take care of hardware failures is being brought into light. With proper 'traps' to detect such hardware failures, the system can be taken into a self-correcting mode, where the OS carefully isolates such inactive processors, and shares the load with the other active processors, thus hiding the internal defects to the outside user.

Earlier Computer mother-boards had two separate places where two CPUs could sit, sharing a single main memory, forming an SMP System. But, most of today's laptops and desktops are based on a "multi-core" SMP System, where multiple processors reside in a single package. Each of suce processors is capable of implementing all optimisations like pipelining and multi-threading, as any normal uni-processor would do. What motivated finding of these multi-core processors? Three "walls" have contributed: 1) The Memory Wall - refers to the increasing gap between processor and memory speeds ; which is pushing for larger cache sizes 2) The ILP (Instruction Level Parallelism) Wall - refers to increasing difficulty in finding parallelism in instruction stream of a process to keep cores busy 3) The Power Wall - Due to the above two 'walls', the performance gain has diminished over time; and power management issues due to this has not been justified.

The current speeds in SMP Systems have clock speeds exceeding 3.7 GHz, with matching bus speeds of about 1333MHz with an 8MB cache. There are instances of cache in the range of 24MB as well, but the clock speeds are kept low at such cache values.

Supercomputer evolution since Cray T3E

Cray T3E was the first real world Supercomputer to be able to sustain 1TFLOPS in a real world application. Designed and developed by Cray and launched in 1995. It is a "Massively Parallel Processing System" (MPPP).

The next models of Cray used the Direct Connected Processor (DCP) Architecture in which the processor is fused into the interconnect structure. This arrangement eliminated memory contention and also optimised message-passing applications by directly linking processors to each other through a high-performance interconnect fabric.

Then came a special category of supercomputers called Clustered supercomputing, where a cluster of MIMD Multiprocessors are connected together, with each cluster's processor being a SIMD.

Current supercomputers can go beyond 300 TFLOPS.While the Cray T3E was about 4 GFLOPS, which was actually the first one to break the 1GFLOPS barrier. Measure of computational speed has gone up to TFLOPS (Tera Floating Point Operations Per Second) and has been moving towards PFLOPS (Peta Floating Point Operations Per Second)

Most modern supercomputers are highly-tuned computer clusters using commodity processors with custom interconnects. Majority of the supercomputers run on some flavour of Unix or Linux. It has been predominantly Linux since 2004.

In the current scenario, there are special purpose Supercomputers available to solve specific problems ;Astrophysics computation and codebreaking ; Molecular Dynamics ; Deepblue for the game of chess and so on.

References

Computer User: http://www.computeruser.com/resources/dictionary/definition.html?lookup=7776
Foldoc : http://foldoc.org/index.cgi?symmetric+multiprocessing
Searchdatacenter : http://searchdatacenter.techtarget.com/sDefinition/0,,sid80_gci214218,00.html
Wikipedia / Supercomputers: http://en.wikipedia.org/wiki/Supercomputers
Intel: http://www.intel.com/cd/ids/developer/asmo-na/eng/95581.htm?page=2
Ligth-speed Memory: http://www.nvnews.net/previews/geforce3/lightspeed_memory.shtml
RAM: http://iram.cs.berkeley.edu/kozyraki/project/ee241/present/sld019.htm
Additional info on Interconnections: http://arstechnica.com/news.ars/post/20070904-abrief-look-at-intels-new-common-systems-interconnect-csi.html
Memory hierarchy: http://webster.cs.ucr.edu/AoA/Windows/HTML/MemoryArchitecture.html
NVIDIA using Cross-bar memory interconnect: http://www.nvnews.net/previews/geforce3/lightspeed_memory.shtml

CSC/ECE 506 Fall 2007/wiki1 7 2281

Contents