CSC/ECE 506 Fall 2007/wiki1 1 11
Sections 1.1 and 1.1.2: Update performance trends in multiprocessors.
1.1 Why parallel architecture ?
The role of computer architect is to maximize productivity and performance - here productivity is programmability and reduction in development time and performance is reasonable throughput for given technology and cost limitations.
Parallelism is the paradigm on all aspects of computing. The number of processors is a new dimension to the design space. Parallelism helps achieving performance at acceptable cost. Current Systems work on parallel concepts and designs (i.e. Desktop systems are Multi-threaded)
Microprocessor: a typical instruction in a processor like an 8088 took 15 clock cycles to execute. Because of the design of the multiplier, it took approximately 80 cycles just to do one 16-bit multiplication on the 8088. In a pipelined architecture, there can be multiple instructions in various stages of execution simultaneously. That way it looks like one instruction completes every clock cycle. Modern processors have multiple instruction decoders, each with its own pipeline. This allows for multiple instruction streams, that allows more than one instructions during each clock cycle.
Supercomputer: The early and mid-1980s saw machines with a modest number of vector processors working in parallel become the standard. (numbers of processors were in the range 4-16). In the later 1980s and 1990s, attention turned from vector processors to massive parallel processing systems with thousands of "ordinary" CPUs - off the shelf units. Today, parallel designs are based on "off the shelf" server-class microprocessors, such as the PowerPC, Itanium, or x86-64, and most modern supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects.
The application of parallel architecture, or the use of multiple processors to solve computing problems, has been used since the early days of computing. Until recently, however, practical benefits of parallel architecture had not been forthcoming. The processors used in parallel architecture, microprocessors, were effective in accomplishing the goal of increasing performance, but these processors in parallel still could not match the performance of the fastest single-processor systems. Individual performance increases in microprocessors outpaced the increases in performance of the fastest processors (such as those used in supercomputers and mainframes). This yields the result that today the best performing processors are low-power, easily manufactured, and effective to use in parallel systems.
Since the application of parallelization is no longer theoretical or academic, it must be studied and recognized as a useful branch of computer science and engineering. As with most branches of computer science, change is inevitable and expected.
1.1.2 Technology trends
Processors It is difficult to wait for single processor to get fast enough. Critical issues in parallel computer architecture are fundamentally similar to that in sequential computer: resource allocation among functional units, caches – locality, wires – communication bandwidth.
Motivation for Parallelism:
- Reduction in the basic VLSI feature size – makes transistor, gates, circuit faster & smaller – more fit in same area
- Useful die size is growing – more area to use
- Clock rate improves in proportion to size increase (1,2). Use of many transistors at once (parallelism) is expected.
Performance of microprocessor has been increasing at much greater rate than clock frequency. Processors are getting faster in large part by making more effective use of an even larger volume of computing resources. Basic single chip building block had 100 million transistors by year 2000. It has raised possibility of placing more computer system on chip – including memory & I/O support.
DUAL CORE Processors System designers are moving toward multi-core processor architectures rather than higher frequency devices to enable higher system performance while minimizing increases in power consumption. Dual core micro-processors, originally conceived for computationally intensive applications such as servers, are now being designed and deployed across a range of embedded applications. Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi-core CPUs.
Cell Processors A surprising driving force in performance computing is the videogame industry. In the quest for more realistic graphics, more sophisticated non-player characters (i.e., artifical intelligence), and more immersive audio environments, Sony, Toshiba and IBM jointly created the Cell processor. Linked in parallel, several Cell processors form the basis for Sony's Playstation 3 videogame console. Cell processors have elements of normal 80x86 processors (CPUs) as well as specialized graphics processors (GPUs), which make them a natural fit for a videogame console. What was unexpected, however, was how well suited Cell processors, especially in parallel, were for scientific and/or academic tasks.
An example of this is the Folding@Home project, which is a project to simulate the folding of various proteins. Folding@Home had been running since 2000 with mainly Intel 80x86 PCs, which average around 3 MegaFLOPS per day. A Folding@Home client was subsequently created for the PS3, which showed that 18 MegaFLOPS per day could be crunched on the Cell processor architecture, despite the fact that most PCs are 1.5-3 times as expensive as a PS3 console. According to the Folding@Home project administrators, PS3's account for about 60% of the total output of the project.
IBM has repurposed the Cell processor from the PS3 into its own line of BladeCenter servers, the Q line. In addition to selling the Q line of Blades to end-users for their own application needs, IBM intends to use Q blades in Project Roadrunner [1].
Dr. Frank Mueller, a faculty member at NCSU, has created a cluster of PS3 consoles for academic use [2]. As a proof of concept, it shows how a very cheap supercomputer can now be made from off the shelf components.
Memory technology: divergence between capacity & speed – capacity increased 1000 times, cycle time – factor of 2. Gap between processor cycle time & memory cycle time – wider. Memory bandwidth demanded by processor is growing rapidly. Latency: access time – One or two levels of caches on chip, additional level of external cache. Multiprocessor design – how to organize collection of caches.
DDR2 - Like all SDRAM implementations, DDR2 stores memory in memory cells that are activated with the use of a clock signal to synchronize their operation with an external data bus. Like DDR before it, DDR2 cells transfer data both on the rising and falling edge of the clock (a technique called double pumping). The key difference between DDR and DDR2 is that in DDR2 the bus is clocked at twice the speed of the memory cells, so four words of data can be transferred per memory cell cycle. Thus, without speeding up the memory cells themselves, DDR2 can effectively operate at twice the bus speed of DDR. [3]
On-chip memory controllers are reducing processor-to-memory latency by a factor of 3 to 4.
Disks: Parallel disk storage system – RAID is becoming norm. Redundant Array of Independent Drives – RAID - combines physical hard disks into a single logical unit either by using special hardware or software. The main aims of using RAID are to improve reliability & speed [4]
Large multilevel caches for files / disk blocks are predominant.
DMA – Direct Memory Access: A DMA transfer essentially copies a block of memory from one device to another. While the CPU initiates the transfer, it does not execute it. For so-called "third party" DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks. [5]
Clusters: Similar to Dr. Mueller's aforementioned PS3 cluster, a growing trend in parallel architecture is to connect multiple off-the-shelf systems for the purposes of solving parallel tasks. Much like early parallel multiprocessor machines, clusters have been explored academically, but recent developments have made clusters more important for real-world applications.
High speed interconnections One of the barriers to useful parallel clusters is intracluster communication between processors in different physical systems. Using 10Mbps Ethernet is fine for browsing the web, but oftentimes too slow to be useful in a cluster. With the growing popularity of Fibre Channel [6], Infiniband [7] and 10 Gigabit Ethernet [8], processor communication between systems can approach communication speeds of processors inside the same system.
Multinode/scalable systems (NUMA) Instead of using generic off-the-shelf components, some systems such as the IBM System x3950 (and its predecessors the IBM eserver xSeries 460, 455, 445, and 440) have specialized interconnects and processor chipsets to simulate one large parallel system [9]. The specialized interconnects are typically faster than general purpose interconnects enumerated in the previous section. Also, the specialized chipsets can communicate explicitly over the interconnects to form an additional level of caching among the nodes as well as implementing a NUMA (Non-Uniform Memory Access) scheme. In NUMA, processes are assigned to processors based on the length of time it takes to communicate between a processor and the memory containing a process. For example, communication from processors to the memory inside the same node is relatively quick compared to communicatoin to memory in another node. By colocating the processor to which a process is assigned with the memory containing the process's code and data, communications overhead is minimized.
Chassis-based systems (blades) Chassis-based systems are those where system units are modules that plug into a chassis. System units are full-fledged computer systems in their own right, but have specialized connections to a backplane in a chassis. The backplane acts as a high-speed interconnect similar to those used in multinode/scalable systems.
Blade systems have changed several aspects of cluster computing. They have epitomized the idea of disposable computing. For example, if a blade system were to fail, most end users would rather replace the failing system with another blade system that can be swapped out in a few seconds. Since blade systems can be more densely packed in racks than normal systems, they introduce issues of cooling and powering. For example, in a 42U rack, 42 1U 2-way servers can be loaded, for a total of 84 processors. A 7U IBM BladeCenter E chassis can be loaded with 14 2-way blade systems. 6 of these chasses in a 42U rack hold 168 processors, which need much more cooling and power.
References
- Parallel Computer Architecture: A Hardware/Software Approach - by David E. Culler, Jaswinder Pal Singh, Anoop Gupta
Further reading
External links
- http://www-03.ibm.com/press/us/en/pressrelease/20210.wss
- http://www.physorg.com/news92674403.html
- http://en.wikipedia.org/wiki/DDR2
- http://en.wikipedia.org/wiki/RAID
- http://en.wikipedia.org/wiki/Direct_memory_access
- http://www.fibrechannel.org/
- http://www.infinibandta.org/home
- http://grouper.ieee.org/groups/802/3/ae/
- http://www-03.ibm.com/systems/x/scalable/index.html