CSC/ECE 506 Fall 2007/wiki1 1 11

From Expertiza_Wiki
Jump to navigation Jump to search

1.1 Why parallel architecture?

The role of a computer architect is to maximize productivity and performance of a computer system. Productivity can be thought of as programmability; more easily programmable systems reduce development time. Performance can be reasonable throughput for a given technology and cost limitations.

The application of parallel architecture, or the use of multiple processors to solve computing problems, has been used since the early days of computing. Until recently, however, practical benefits of parallel architecture had not been forthcoming. The processors used in parallel architecture, microprocessors, were effective in accomplishing the goal of increasing performance, but these processors in parallel still could not match the performance of the fastest single-processor systems. The early and mid 1980s saw machines with a modest number of vector processors working in parallel become the standard, with numbers of processors in the range of 4 to 16. In the late 1980s and 1990s, individual performance increases in "off the shelf" microprocessors outpaced the increases in performance of the fastest processors (such as those used in supercomputers and mainframes). This yields the result that today the best performing processors are low-power, easily manufactured, and effective to use in parallel systems. Commodity processors such as IBM's PowerPC, Intel's Itanium, and Intel and AMD x86-64 comprise most modern supercomputers.

Aside from the obvious parallelism of having multiple processors running in parallel, designs in microprocessors take advantage of another type of parallelism: pipelines. An example is in the Intel 80x86 family. A typical instruction in a processor like an 8088 takes 15 clock cycles to execute. Because of the design of the multiplier, it takes approximately 80 cycles just to do one 16-bit multiplication. In a pipelined architecture, there can be multiple instructions in various stages of execution simultaneously. That way it looks like one instruction completes every clock cycle. Modern processors have multiple instruction decoders, each with its own pipeline. This allows for multiple instruction streams, which allows more than one instruction to be handled during each clock cycle.

Since the application of parallelization is no longer theoretical or academic, it must be studied and recognized as a useful branch of computer science and engineering. As with most branches of computer science, change is inevitable and expected.

1.1.2 Technology trends

Processors

Increases in performance in single processors have led to diminishing returns, with Moore's law being unsustainable because of physical constraints. Fortunately, critical issues in parallel computer architecture are fundamentally similar to those in single-processor systems: resource allocation among functional units, caching, and communication bandwidth.

Motivation for Parallelism:

  1. Reduction in the basic VLSI feature size – makes transistors, gates, circuits faster and smaller so more fit in same area
  2. Increase in useful die size is growing – more area to use
  3. Improvement in clock is proportional to size increase (1,2). Use of many transistors at once (parallelism) is expected.

Performance of microprocessors has been increasing at a much greater rate than clock frequency. Processors are getting faster in large part by making more effective use of an even larger volume of computing resources. Basic single chip building blocks had 100 million transistors in 2000. This has raised the possibility of placing more of the computer system in the processor, including memory and input/output support.

Multi Core Processors

System designers are moving toward multi core processor architectures rather than higher frequency devices, where more than one microprocessor is physically colocated into one physical package. This enables higher system performance while minimizing increases in power consumption. Multi core microprocessors, originally conceived for computationally intensive applications such as servers, are now being designed and deployed across a range of embedded applications. Many applications are better suited to thread level parallelism (TLP) methods, and the use of multiple independent CPUs is one common method used to increase a system's overall TLP. A combination of increased available space due to refined manufacturing processes and the demand for increased TLP is the logic behind the creation of multi core CPUs.

Cell Processors

A surprising driving force in performance computing is the videogame industry. In the quest for more realistic graphics, more sophisticated non-player characters (i.e., artifical intelligence), and more immersive audio environments, Sony, Toshiba and IBM jointly created the Cell processor. Linked in parallel, several Cell processors form the basis for Sony's Playstation 3 videogame console. Cell processors have elements of normal 80x86 processors (CPUs) as well as specialized graphics processors (GPUs), which make them a natural fit for a videogame console. What was unexpected, however, was how well suited Cell processors, especially in parallel, were for scientific and/or academic tasks.

An example of this is the Folding@Home project, which is a project to simulate the folding of various proteins. Folding@Home had been running since 2000 with mainly Intel 80x86 PCs, which average around 3 MegaFLOPS per day. A Folding@Home client was subsequently created for the PS3, which showed that 18 MegaFLOPS per day could be crunched on the Cell processor architecture, despite the fact that most PCs are 1.5-3 times as expensive as a PS3 console. According to the Folding@Home project administrators, PS3's account for about 60% of the total output of the project.

IBM has repurposed the Cell processor from the PS3 into its own line of BladeCenter servers, the Q line. In addition to selling the Q line of Blades to end-users for their own application needs, IBM intends to use Q blades in Project Roadrunner [1].

Dr. Frank Mueller, a faculty member at NCSU, has created a cluster of PS3 consoles for academic use [2]. As a proof of concept, it shows how a very cheap supercomputer can now be made from off the shelf components.

Memory technology

Caches

The trend in memory technology in recent computing history has favored capacity to speed. In the time that capacity increased by a factor of 1000, the average read/write cycle was only sped up by a factor of 2. This has widened the gap between the length of time required for a processor cycle as compared to that of a memory cycle. To mitigate this, memory bandwidth, or the amount of memory that can be read in a certain amount of time, is growing rapidly. Another technique to address the disparity between processor and memory cycle times is by integrating larger and more levels of caches. Now, one or two levels of cache in the processor package are typical, with one or more additional levels of external caches. The closer to the processor, cache memory is faster, more expensive, and thus smaller. An important part of modern multiprocessor design is how to organize the collection of caches instead of just one level of cache.

DDR and DDR2

Like all SDRAM implementations, DDR (Double Data Rate) stores memory in memory cells that are activated with the use of a clock signal to synchronize their operation with an external data bus. What sets DDR apart from normal SDRAM is that DDR cells transfer data both on the rising and falling edge of the clock (a technique called double pumping). [3] DDR2 goes one step further by clocking the memory bus at twice the speed of the memory cells, allowing four words of data to be transferred per memory cell cycle. Thus, without speeding up the memory cells themselves, DDR2 can effectively operate at twice the bus speed of DDR. [4]

On-chip memory controllers

For much of the history of 80x86 processors and their contemporaries, memory control was housed in the North Bridge, which is a component of the system's core chipset. By moving memory control into the processor's physical package, processor to memory latency can be greatly reduced. This is technologically feasible for the same reason as multi core processors.

Disks

RAID

Disks are the slowest part of a computer system (other than the user of course). Parallelism saves the day here with RAID. [5] RAID, or Redundant Array of Independent Drives, combines physical hard disks into a single logical unit either by using special hardware of software. The main aims of using RAID are to improve reliability, performance, and capacity, with the various levels of RAID making tradeoffs between reliability and speed. RAID 0 is a striped set of multiple drives which increases capacity and performance, but lowers reliability (because now a failure in either drive can destroy the set). RAID 1 is a mirrored set of disks. Performance and capacity are not increased, but the members of the set provide real-time backups for each other. RAID 5 is a combination of RAID 0 and 1, where capacity and performance are distributed among the drives in the set. Each drive is a partial mirror of the other drives such that failure of one drive will not destroy the complete set.

Caches

Similar caching techniques for memory can be applied to disks. The need for good caching is even more important in disks since not only is the relatively slow speed of disks a concern, but also wear and tear on the moving parts of the disks. The more often data can be held in caches (in memory), the less the read and write heads of the drives need to move. Another added benefit of using the read and write heads less is reliability; with less read and write head movement there is less likelihood of a mechanical failure that destroys the disk.

DMA

Direct Memory Access: A DMA transfer essentially copies a block of memory from one device to another. While the CPU initiates the transfer, it does not execute it. For so-called "third party" DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. A typical usage of DMA is copying a block of memory from system RAM to or from a buffer on the device. Such an operation does not stall the processor, which as a result can be scheduled to perform other tasks. [6]

Clusters

Similar to Dr. Mueller's aforementioned PS3 cluster, a growing trend in parallel architecture is to connect multiple off-the-shelf systems for the purposes of solving parallel tasks. Much like early parallel multiprocessor machines, clusters have been explored academically, but recent developments have made clusters more important for real-world applications.

High speed interconnections

One of the barriers to useful parallel clusters is intracluster communication between processors in different physical systems. Using 10Mbps Ethernet is fine for browsing the web, but oftentimes too slow to be useful in a cluster. With the growing popularity of Fibre Channel [7], Infiniband [8] and 10 Gigabit Ethernet [9], processor communication between systems can approach communication speeds of processors inside the same system.

Multinode/scalable systems (NUMA)

Instead of using generic off-the-shelf components, some systems such as the IBM System x3950 (and its predecessors the IBM eserver xSeries 460, 455, 445, and 440) have specialized interconnects and processor chipsets to simulate one large parallel system [10]. The specialized interconnects are typically faster than general purpose interconnects enumerated in the previous section. Also, the specialized chipsets can communicate explicitly over the interconnects to form an additional level of caching among the nodes as well as implementing a NUMA (Non-Uniform Memory Access) scheme. In NUMA, processes are assigned to processors based on the length of time it takes to communicate between a processor and the memory containing a process. For example, communication from processors to the memory inside the same node is relatively quick compared to communicatoin to memory in another node. By colocating the processor to which a process is assigned with the memory containing the process's code and data, communications overhead is minimized.

Chassis-based systems (blades)

Chassis-based systems are those where system units are modules that plug into a chassis. System units are full-fledged computer systems in their own right, but have specialized connections to a backplane in a chassis. The backplane acts as a high-speed interconnect similar to those used in multinode/scalable systems.

Blade systems have changed several aspects of cluster computing. They have epitomized the idea of disposable computing. For example, if a blade system were to fail, most end users would rather replace the failing system with another blade system that can be swapped out in a few seconds. Since blade systems can be more densely packed in racks than normal systems, they introduce issues of cooling and powering. For example, in a 42U rack, 42 1U 2-way servers can be loaded, for a total of 84 processors. A 7U IBM BladeCenter E chassis can be loaded with 14 2-way blade systems. 6 of these chasses in a 42U rack hold 168 processors, which need much more cooling and power.

References

  1. Parallel Computer Architecture: A Hardware/Software Approach - by David E. Culler, Jaswinder Pal Singh, Anoop Gupta

Further reading

External links

  1. http://www-03.ibm.com/press/us/en/pressrelease/20210.wss
  2. http://www.physorg.com/news92674403.html
  3. http://en.wikipedia.org/wiki/DDR2
  4. http://en.wikipedia.org/wiki/DDR2
  5. http://en.wikipedia.org/wiki/RAID
  6. http://en.wikipedia.org/wiki/Direct_memory_access
  7. http://www.fibrechannel.org/
  8. http://www.infinibandta.org/home
  9. http://grouper.ieee.org/groups/802/3/ae/
  10. http://www-03.ibm.com/systems/x/scalable/index.html