CSC/ECE 506 Fall 2007/wiki1 4 la

From Expertiza_Wiki
Jump to navigation Jump to search

Update section 1.1.3: Architectural Trends

Microprocessor Design Trends

Up to 1986, advancements in microprocessors were dominated by bit-level parallelism. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the established norm has been 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned for use in microprocessors. However, graphics processors (GPU) have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.

Instruction-level parallelism took off as advancements in bit-level parallelism receded. After all, the benefits possible by advancements in bit-level parallelism are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.

The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache, speculative execution, cache coherency protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: thread-level parallelism.

Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is Multi-Core on chip and the second is Simultaneous Multi-Threading (SMT) (also known as Hyper-Threading). Industry refrained at this point from using the clock speed as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores (from AMD and Sun respectively). It is foreseen to see sixteen cores on a single chip in a matter of months.

Clock Speed and Parallelism

In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The Multi-Core era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by Intel and then by AMD. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.

In 2002, Intel released the Itanium microprocessor, which takes advantage of explicit instruction-level parallelism. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that thread-level parallelism will be exploited via multi-core chips.

Instruction Sets and Parallelism

Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with Explicit Parallel Instruction Computing (EPIC). This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for multiprocessors was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.

In 1999, the Streaming SIMD Extensions (SSE) instruction set was introduced by Intel. This instruction set added eight new 128 bit registers and 70 floating point instructions. In 2000, Intel added a complete complement of integer instructions and 64-bit SIMD floating point instructions to the original SSE registers when they introduced the SSE2 instruction set. In 2004, a revision of Intel's Pentium 4 processor introduced the SSE3 instruction set. This instruction set added specific memory and thread-handling instructions, which improved the performance of Intel's HyperThreading technology.

In an attempt to keep pace with Intel, AMD licensed the SSE3 instruction set and implemented most of its instructions in particular Athlon 64 processors. In the summer of 2007, AMD introduced a new extension of the x86 instruction set: SSE5. This extension was designed to increase application efficiency and performance by allowing software developers to simplify code and by providing them with additional capabilities.

Silicon Technologies

In 1998, IBM announced its first PowerPC microprocessor designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the Silicon-On-Insulator (SOI) technology, which saved a significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a high-K material and electrode metals (instead of polysilicon) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip, among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay, thus continuing the legacy of Moore's Law.

System Design Trends

System design has become a very diverse field. There are systems that utilize a single backplane which supports a small amount of microprocessors. Although the number of microprocessors has slowly been inching up, such a technology has been limited to desktops and workstations. Larger loads of work need more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple companies are persuing it, such as IBM and HP. Other companies pursued different technologies, such as ccNUMA and blade servers, for tight clustering. Larger clusters utilize computer-to-computer links, such as Infiniband. Such clusters enter the realm of supercomputing, which deserve their own topic.

PC Direction

The number of supported microprocessors in a computer is ever increasing. Since the mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).

Server Direction

Figure 1 shows the number of processors that have been supported in a shared bus (for the past decade). A commonality between the technology appearing this decade and in the last decade is that servers throughout these decades supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.

http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG
Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.


A different class of servers is emerging which is neither an SMP or a cluster. It is called ccNUMA. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is SGI, with its Origin 350 server supporting up to 32 microprocessors.

Shared Memory Bus Direction

As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the bandwidth of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called HyperTransport (HT) was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).

Techniques have also been implemented to alleviate the strain put on the bus. With the Pentium III, Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.

http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG


Figure 2. Bandwidth of the shared memory bus in commercial multiprocessors.

References

Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999.
http://compoundsemiconductor.net/articles/news/11/1/25
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf
http://www.endian.net/details.aspx?ItemNo=655
http://www.hpcwire.com/hpc/1754487.html
http://www.hypertransport.org/
http://www.mbipr.com/whitepaper5.pdf
http://www.sgi.com/products/remarketed/offering.html
http://www.sun.com/processors/
http://www.theinquirer.net/?article=9235