CSC/ECE 506 Fall 2007/wiki1 1 11
Sections 1.1 and 1.1.2: Update performance trends in multiprocessors.
1.1 Why parallel architecture ?
The role of computer architect is to maximize productivity and performance - here productivity is programmability and reduction in development time and performance is reasonable throughput for given technology and cost limitations.
Parallelism is the paradigm on all aspects of computing. The number of processors is a new dimension to the design space. Parallelism helps achieving performance at acceptable cost.
Current Systems work on parallel concepts and designs (i.e. Desktop systems are Multi-threaded)
Microprocessor, minicomputer, mainframe, supercomputer – performance trend over time. Single chip Microprocessor dominating – 1990s. Technological & architectural trends that strives to meet application demand for increased performance.
The application of parallel architecture, or the use of multiple processors to solve computing problems, has been used since the early days of computing. Until recently, however, practical benefits of parallel architecture had not been forthcoming. The processors used in parallel architecture, microprocessors, were effective in accomplishing the goal of increasing performance, but these processors in parallel still could not match the performance of the fastest single-processor systems. Individual performance increases in microprocessors outpaced the increases in performance of the fastest processors (such as those used in supercomputers and mainframes). This yields the result that today the best performing processors are low-power, easily manufactured, and effective to use in parallel systems.
Since the application of parallelization is no longer theoretical or academic, it must be studied and recognized as a useful branch of computer science and engineering. As with most branches of computer science, change is inevitable and expected.
1.1.2 Technology trends
Scientific & engineering computing, commercial computing
Processors Difficult to wait for single processor to get fast enough Critical issues in parallel computer architecture are fundamentally similar to that in sequential computer: resource allocation among functional units, caches – locality, wires – communication bandwidth
1. Reduction in the basic VLSI feature size – makes transistor, gates, circuit faster & smaller – more fit in same area 2. Useful die size is growing – more area to use 3. Clock rate improves in proportion to size increase (1,2). Use of may transistors at once (parallelism) is expected.
Performance of microprocessor has been increasing at much greater rate than clock frequency. Benchmark for measuring workstation performance: SPEC, LINPACK Processors are getting faster in large part by making more effective use of an even larger volume of computing resources.
Basic single chip building block – 100 million transistors by year 2000. Raises possibility of placing more computer system on chip – including memory & I/O support. Possibility of placing multiple processors on chip. Evident commercially – system-on-a-chip: embedded systems
DUAL CORE Processors System designers are moving toward multi-core processor architectures rather than higher frequency devices to enable higher system performance while minimizing increases in power consumption. Dual core micro-processors, originally conceived for computationally intensive applications such as servers, are now being designed and deployed across a range of embedded applications. Many applications are better suited to thread level parallelism (TLP) methods, and multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi-core CPUs.
Memory technology: divergence between capacity & speed – capacity increased 1000 times, cycle time – factor of 2. Gap between processor cycle time & memory cycle time – wider. Memory bandwidth demanded by processor is growing rapidly. Latency: access time – One or two levels of caches on chip, additional level of external cache. Multiprocessor design – how to organize collection of caches.
DDR2 - Like all SDRAM implementations, DDR2 stores memory in memory cells that are activated with the use of a clock signal to synchronize their operation with an external data bus. Like DDR before it, DDR2 cells transfer data both on the rising and falling edge of the clock (a technique called double pumping). The key difference between DDR and DDR2 is that in DDR2 the bus is clocked at twice the speed of the memory cells, so four words of data can be transferred per memory cell cycle. Thus, without speeding up the memory cells themselves, DDR2 can effectively operate at twice the bus speed of DDR. http://en.wikipedia.org/wiki/DDR2
On-chip memory controllers are reducing processor-to-memory latency by a factor of 3 to 4.
Disks: Parallel disk storage system – RAID is becoming norm. Redundant Array of Independent Drives – RAID - combines physical hard disks into a single logical unit either by using special hardware or software. The main aims of using RAID are to improve reliability & speed http://en.wikipedia.org/wiki/RAID
Large multilevel caches for files / disk blocks are predominant.
DMA – Direct Memory Access: A DMA transfer essentially copies a block of memory from one device to another. While the CPU initiates the transfer, it does not execute it. For so-called "third party" DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks. http://en.wikipedia.org/wiki/Direct_memory_access