CSC/ECE 506 Fall 2007/wiki1 1 11

From Expertiza_Wiki
Jump to navigation Jump to search

Sections 1.1 and 1.1.2: Update performance trends in multiprocessors.


1.1 Why parallel architecture ?

The role of computer architect is to maximize productivity and performance - here productivity is programmability and reduction in development time and performance is reasonable throughput for given technology and cost limitations.

Parallelism is the paradigm on all aspects of computing. The number of processors is a new dimension to the design space. Parallelism helps achieving performance at acceptable cost.

Current Systems work on parallel concepts and designs (i.e. Desktop systems are Multi-threaded)

Microprocessor, minicomputer, mainframe, supercomputer – performance trend over time. Single chip Microprocessor dominating – 1990s. Technological & architectural trends that strives to meet application demand for increased performance.

The application of parallel architecture, or the use of multiple processors to solve computing problems, has been used since the early days of computing. Until recently, however, practical benefits of parallel architecture had not been forthcoming. The processors used in parallel architecture, microprocessors, were effective in accomplishing the goal of increasing performance, but these processors in parallel still could not match the performance of the fastest single-processor systems. Individual performance increases in microprocessors outpaced the increases in performance of the fastest processors (such as those used in supercomputers and mainframes). This yields the result that today the best performing processors are low-power, easily manufactured, and effective to use in parallel systems.

Since the application of parallelization is no longer theoretical or academic, it must be studied and recognized as a useful branch of computer science and engineering. As with most branches of computer science, change is inevitable and expected.

1.1.2 Technology trends

Scientific & engineering computing, commercial computing

Processors Difficult to wait for single processor to get fast enough Critical issues in parallel computer architecture are fundamentally similar to that in sequential computer: resource allocation among functional units, caches – locality, wires – communication bandwidth

1. Reduction in the basic VLSI feature size – makes transistor, gates, circuit faster & smaller – more fit in same area 2. Useful die size is growing – more area to use 3. Clock rate improves in proportion to size increase (1,2). Use of may transistors at once (parallelism) is expected.

Performance of microprocessor has been increasing at much greater rate than clock frequency. Benchmark for measuring workstation performance: SPEC, LINPACK Processors are getting faster in large part by making more effective use of an even larger volume of computing resources.

Basic single chip building block – 100 million transistors by year 2000. Raises possibility of placing more computer system on chip – including memory & I/O support. Possibility of placing multiple processors on chip. Evident commercially – system-on-a-chip: embedded systems

DUAL CORE Processors System designers are moving toward multi-core processor architectures rather than higher frequency devices to enable higher system performance while minimizing increases in power consumption. Dual core micro-processors, originally conceived for computationally intensive applications such as servers, are now being designed and deployed across a range of embedded applications. Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi-core CPUs.

Cell Processors A surprising driving force in performance computing is the videogame industry. In the quest for more realistic graphics, more sophisticated non-player characters (i.e., artifical intelligence), and more immersive audio environments, Sony, Toshiba and IBM jointly created the Cell processor. Linked in parallel, several Cell processors form the basis for Sony's Playstation 3 videogame console. Cell processors have elements of normal 80x86 processors (CPUs) as well as specialized graphics processors (GPUs), which make them a natural fit for a videogame console. What was unexpected, however, was how well suited Cell processors, especially in parallel, were for scientific and/or academic tasks.

An example of this is the Folding@Home project, which is a project to simulate the folding of various proteins. Folding@Home had been running since 2000 with mainly Intel 80x86 PCs, which average around 3 MegaFLOPS per day. A Folding@Home client was subsequently created for the PS3, which showed that 18 MegaFLOPS per day could be crunched on the Cell processor architecture, despite the fact that most PCs are 1.5-3 times as expensive as a PS3 console. According to the Folding@Home project administrators, PS3's account for about 60% of the total output of the project.

IBM has repurposed the Cell processor from the PS3 into its own line of BladeCenter servers, the Q line. In addition to selling the Q line of Blades to end-users for their own application needs, IBM intends to use Q blades in Project Roadrunner [1].

Dr. Frank Mueller, a faculty member at NCSU, has created a cluster of PS3 consoles for academic use [2]. As a proof of concept, it shows how a very cheap supercomputer can now be made from off the shelf components.

Memory technology: divergence between capacity & speed – capacity increased 1000 times, cycle time – factor of 2. Gap between processor cycle time & memory cycle time – wider. Memory bandwidth demanded by processor is growing rapidly. Latency: access time – One or two levels of caches on chip, additional level of external cache. Multiprocessor design – how to organize collection of caches.

DDR2 - Like all SDRAM implementations, DDR2 stores memory in memory cells that are activated with the use of a clock signal to synchronize their operation with an external data bus. Like DDR before it, DDR2 cells transfer data both on the rising and falling edge of the clock (a technique called double pumping). The key difference between DDR and DDR2 is that in DDR2 the bus is clocked at twice the speed of the memory cells, so four words of data can be transferred per memory cell cycle. Thus, without speeding up the memory cells themselves, DDR2 can effectively operate at twice the bus speed of DDR. http://en.wikipedia.org/wiki/DDR2

On-chip memory controllers are reducing processor-to-memory latency by a factor of 3 to 4.

Disks: Parallel disk storage system – RAID is becoming norm. Redundant Array of Independent Drives – RAID - combines physical hard disks into a single logical unit either by using special hardware or software. The main aims of using RAID are to improve reliability & speed http://en.wikipedia.org/wiki/RAID

Large multilevel caches for files / disk blocks are predominant.

DMA – Direct Memory Access: A DMA transfer essentially copies a block of memory from one device to another. While the CPU initiates the transfer, it does not execute it. For so-called "third party" DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks. http://en.wikipedia.org/wiki/Direct_memory_access