CSC/ECE 506 Fall 2007/wiki1 4 la: Difference between revisions
No edit summary |
|||
(17 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
==Microprocessor Design Trends== | ==Microprocessor Design Trends== | ||
Up to 1986, advancements in microprocessors were dominated by ''bit-level parallelism''. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the established norm has been 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned for use in microprocessors. However, [http://en.wikipedia.org/wiki/Graphics_processing_unit graphics processors (GPU)] have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming. | |||
''Instruction-level parallelism'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one. | ''[http://en.wikipedia.org/wiki/Instruction_level_parallelism Instruction-level parallelism]'' took off as advancements in ''bit-level parallelism'' receded. After all, the benefits possible by advancements in ''bit-level parallelism'' are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one. | ||
The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache | The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. [http://en.wikipedia.org/wiki/Superscalar Superscalar] microprocessors were created, which encompassed [http://en.wikipedia.org/wiki/Branch_prediction branch predictors], [http://en.wikipedia.org/wiki/Out_of_order_execution out-of-order execution], deeper and larger levels of [http://en.wikipedia.org/wiki/Cache cache], [http://en.wikipedia.org/wiki/Speculative_execution speculative execution], [http://en.wikipedia.org/wiki/Cache_coherency cache coherency] protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: ''[http://en.wikipedia.org/wiki/Thread_level_parallelism thread-level parallelism].'' | ||
Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is '' | Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is ''[http://en.wikipedia.org/wiki/Multi-core_%28computing%29 Multi-Core]'' on chip and the second is ''[http://en.wikipedia.org/wiki/Simultaneous_multithreading Simultaneous Multi-Threading (SMT)]'' (also known as ''[http://en.wikipedia.org/wiki/Hyper_threading Hyper-Threading]''). Industry refrained at this point from using the [http://en.wikipedia.org/wiki/Clock_speed clock speed] as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores (from AMD and Sun respectively). It is foreseen to see sixteen cores on a single chip in a matter of months. | ||
===Clock Speed and Parallelism=== | ===Clock Speed and Parallelism=== | ||
In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by Intel and then by AMD. Quad-core processors have reached the market, and octal-cores may hit the market by 2009. | In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The ''Multi-Core'' era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by [http://en.wikipedia.org/wiki/Intel Intel] and then by [http://en.wikipedia.org/wiki/Amd AMD]. Quad-core processors have reached the market, and octal-cores may hit the market by 2009. | ||
In 2002, Intel released the Itanium microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, | In 2002, Intel released the [http://en.wikipedia.org/wiki/Itanium Itanium] microprocessor, which takes advantage of explicit ''instruction-level parallelism''. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that ''thread-level parallelism'' will be exploited via multi-core chips. | ||
===Instruction Sets and Parallelism=== | |||
Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with [http://en.wikipedia.org/wiki/Explicitly_Parallel_Instruction_Computing Explicit Parallel Instruction Computing (EPIC)]. This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for [http://en.wikipedia.org/wiki/Multiprocessing multiprocessors] was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software. | |||
In 1999, the [http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Streaming SIMD Extensions (SSE)] instruction set was introduced by Intel. This instruction set added eight new 128 bit registers and 70 floating point instructions. In 2000, Intel added a complete complement of integer instructions and 64-bit SIMD floating point instructions to the original SSE registers when they introduced the SSE2 instruction set. In 2004, a revision of Intel's Pentium 4 processor introduced the SSE3 instruction set. This instruction set added specific memory and thread-handling instructions, which improved the performance of Intel's HyperThreading technology. | |||
In an attempt to keep pace with Intel, AMD licensed the SSE3 instruction set and implemented most of its instructions in particular Athlon 64 processors. In the summer of 2007, AMD introduced a new extension of the x86 instruction set: [http://developer.amd.com/sse5.jsp SSE5]. This extension was designed to increase application efficiency and performance by allowing software developers to simplify code and by providing them with additional capabilities. | |||
===Silicon Technologies=== | ===Silicon Technologies=== | ||
In 1998, IBM announced its first PowerPC microprocessor | In 1998, IBM announced its first [http://en.wikipedia.org/wiki/Powerpc PowerPC] microprocessor designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the [http://en.wikipedia.org/wiki/Silicon_on_insulator Silicon-On-Insulator (SOI)] technology, which saved a significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a [http://en.wikipedia.org/wiki/High-k_dielectric high-K] material and electrode metals (instead of [http://en.wikipedia.org/wiki/Polysilicon polysilicon]) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip, among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay, thus continuing the legacy of [http://en.wikipedia.org/wiki/Moore%27s_law Moore's Law]. | ||
==System Design Trends== | ==System Design Trends== | ||
System design has become a very diverse field. There are systems that utilize a single backplane which supports a small amount of microprocessors. Although the number of microprocessors has slowly been inching up, such a technology has been limited to desktops and workstations. Larger loads of work need more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple companies are persuing it, such as IBM and HP. Other companies pursued different technologies, such as [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA] and [http://en.wikipedia.org/wiki/Blade_server blade servers], for tight clustering. Larger clusters utilize computer-to-computer links, such as [http://en.wikipedia.org/wiki/Infiniband Infiniband]. Such clusters enter the realm of [http://en.wikipedia.org/wiki/Supercomputer supercomputing], which deserve their [http://pg.ece.ncsu.edu/mediawiki/index.php/CSC/ECE_506_Fall_2007/wiki1_5_1008 own topic]. | |||
The number of supported | ===PC Direction=== | ||
The number of supported microprocessors in a computer is ever increasing. Since the mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support). | |||
===Server Direction=== | ===Server Direction=== | ||
Figure 1 shows the number of processors that have been supported in a shared bus | Figure 1 shows the number of processors that have been supported in a shared bus (for the past decade). A commonality between the technology appearing this decade and in the last decade is that servers throughout these decades supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade. | ||
<center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center> | <center>http://upload.wikimedia.org/wikipedia/commons/3/32/Procs.JPG</center> | ||
<center>Figure 1. Number of processors in fully configured commercial bus-based share memory multiprocessors.</center> | |||
<br> | <br> | ||
A different class of servers is emerging which is neither an SMP or a cluster. It is called ccNUMA. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is SGI, with its Origin 350 server supporting up to 32 microprocessors. | A different class of servers is emerging which is neither an SMP or a cluster. It is called [http://en.wikipedia.org/wiki/Ccnuma#Cache_coherent_NUMA_.28ccNUMA.29 ccNUMA]. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is [http://en.wikipedia.org/wiki/Silicon_Graphics SGI], with its [http://en.wikipedia.org/wiki/SGI_Origin_350 Origin 350] server supporting up to 32 microprocessors. | ||
===Shared Memory Bus Direction=== | ===Shared Memory Bus Direction=== | ||
As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the [http://en.wikipedia.org/wiki/Bandwidth bandwidth] of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called [http://en.wikipedia.org/wiki/Hyper_transport HyperTransport (HT)] was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above). | |||
Techniques have also been implemented to alleviate the strain put on the bus. With the [http://en.wikipedia.org/wiki/Pentium_3 Pentium III], Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location. | |||
<center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center> | <center>http://upload.wikimedia.org/wikipedia/commons/5/5e/Bandwidth.JPG</center> | ||
Line 45: | Line 59: | ||
==References== | ==References== | ||
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999.<br> | Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999.<br> | ||
http://compoundsemiconductor.net/articles/news/11/1/25<br> | |||
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html<br> | |||
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html<br> | |||
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF<br> | |||
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf<br> | |||
http://www.endian.net/details.aspx?ItemNo=655<br> | http://www.endian.net/details.aspx?ItemNo=655<br> | ||
http://www | http://www.hpcwire.com/hpc/1754487.html<br> | ||
http://www | http://www.hypertransport.org/<br> | ||
http://www. | http://www.mbipr.com/whitepaper5.pdf<br> | ||
http://www.sgi.com/products/remarketed/offering.html<br> | |||
http://www.sun.com/processors/<br> | http://www.sun.com/processors/<br> | ||
http://www. | http://www.theinquirer.net/?article=9235 | ||
Latest revision as of 21:17, 10 September 2007
Update section 1.1.3: Architectural Trends
Microprocessor Design Trends
Up to 1986, advancements in microprocessors were dominated by bit-level parallelism. It started with 4-bit datapaths, followed by 8-bit, 16-bit and 32-bit wide datapaths. In server design, the established norm has been 64-bit since the start of the millennium. A 128-bit datapath is rarely mentioned for use in microprocessors. However, graphics processors (GPU) have been using 128-bit and 256-bit wide datapaths and it is possible to see an increase to 512-bit wide datapaths soon, especially with the advancements in computer graphics, animations and gaming.
Instruction-level parallelism took off as advancements in bit-level parallelism receded. After all, the benefits possible by advancements in bit-level parallelism are limited to the ability to address more storage space and the ability to do more in a single cycle. The latter benefit has been limited to more precise floating point calculation although some microprocessors have the ability to bundle a couple of instructions into one.
The period within the 1980's and 1990's indeed set the stage for the modern microprocessor. Superscalar microprocessors were created, which encompassed branch predictors, out-of-order execution, deeper and larger levels of cache, speculative execution, cache coherency protocols and the ability to communicate with other microprocessors on chip. Research done in the 1990's and early 2000's set the stage for the next level of parallelism to be exploited: thread-level parallelism.
Two technologies appeard in the 2000's that altered the microprocessor performance race. The first is Multi-Core on chip and the second is Simultaneous Multi-Threading (SMT) (also known as Hyper-Threading). Industry refrained at this point from using the clock speed as the performance metric since a microprocessor encompassed many more intertwined technologies than merely speeding up the clock cycle. The industry has seen two cores on a single chip. Then, it saw cores taking advantage of SMT. The number of cores and the number of threads exploited in a microprocessor are ever increasing. Both core and thread technologies are increasing in the number of threads they are able to support. Dual core processors and dual thread processors are already in existence with the promise to merge both technologies so each core can support two threads. There are microprocessors in existence today with four and eight cores (from AMD and Sun respectively). It is foreseen to see sixteen cores on a single chip in a matter of months.
Clock Speed and Parallelism
In the PC system world, throughout the 1990's and early 2000's, increasing chip clock speed was the standard way to increase system performance. Desktop processors topped 1GHz clock speeds in 2000, 2GHz in 2001, and topped 3GHz in 2002. But, due to power demands and heat concerns, this trend has since been discontinued. Design obstacles, especially in laptop computers, meant that other methods had to be pursued in order to increase processing power without losing efficiency. The Multi-Core era was then introduced to the PC world. In the spring of 2005, dual-core chips were introduced by Intel and then by AMD. Quad-core processors have reached the market, and octal-cores may hit the market by 2009.
In 2002, Intel released the Itanium microprocessor, which takes advantage of explicit instruction-level parallelism. The compiler makes decisions about which instructions to execute in parallel, allowing the processor to execute up to six instructions per clock cycle. Although the original (and several subsequent) Itanium processors contained a single core, in 2006, Intel released an Itanium dual core microprocessor. The future of the Itanium family will follow the trend of most other microprocessors, in that thread-level parallelism will be exploited via multi-core chips.
Instruction Sets and Parallelism
Following the direction of gearing away from making the clock speed faster, research in instruction sets took off again in the 1990s to exploit more parallelism with Explicit Parallel Instruction Computing (EPIC). This technology was implemented in the Itanium processor. It utilizes software in order to exploit more parallelism within instructions. In the early 2000s, support for multiprocessors was added to instruction sets. This was done by allowing multiprocessors to communicate gluelessly. Multiprocessors are increasingly becoming more able to communicate in a point-to-point fashion without the need for extra hardware or software.
In 1999, the Streaming SIMD Extensions (SSE) instruction set was introduced by Intel. This instruction set added eight new 128 bit registers and 70 floating point instructions. In 2000, Intel added a complete complement of integer instructions and 64-bit SIMD floating point instructions to the original SSE registers when they introduced the SSE2 instruction set. In 2004, a revision of Intel's Pentium 4 processor introduced the SSE3 instruction set. This instruction set added specific memory and thread-handling instructions, which improved the performance of Intel's HyperThreading technology.
In an attempt to keep pace with Intel, AMD licensed the SSE3 instruction set and implemented most of its instructions in particular Athlon 64 processors. In the summer of 2007, AMD introduced a new extension of the x86 instruction set: SSE5. This extension was designed to increase application efficiency and performance by allowing software developers to simplify code and by providing them with additional capabilities.
Silicon Technologies
In 1998, IBM announced its first PowerPC microprocessor designed using copper wiring. IBM claimed that its performance was boosted by up to a third by utilizing that technology. In 2004, it announced developing chips utilizing the Silicon-On-Insulator (SOI) technology, which saved a significant amount of power. Finally in 2007, Intel and IBM announced recently that they were able to produce a high-K material and electrode metals (instead of polysilicon) that will enable the mass production of chips in 45nm technology. Dual core and dual threaded microprocessors have already been designed in 65nm technology. Designing microprocessors in 45nm technology will enable adding more cores and cache to the chip, among other features. Coupled with the technologies mentioned earlier, performance will increase and power consumption will be kept at bay, thus continuing the legacy of Moore's Law.
System Design Trends
System design has become a very diverse field. There are systems that utilize a single backplane which supports a small amount of microprocessors. Although the number of microprocessors has slowly been inching up, such a technology has been limited to desktops and workstations. Larger loads of work need more microprocessors. Creativity settled in on how to gather those microprocessors into a single system. Some companies took on the challenge of packing many microprocessors into a single system utilizing a shared bus. That challenge has been so tough that only a couple companies are persuing it, such as IBM and HP. Other companies pursued different technologies, such as ccNUMA and blade servers, for tight clustering. Larger clusters utilize computer-to-computer links, such as Infiniband. Such clusters enter the realm of supercomputing, which deserve their own topic.
PC Direction
The number of supported microprocessors in a computer is ever increasing. Since the mid 2000's, the norm has increasingly been to support more than one processor in a desktop computer (with laptops following closely behind). Intel and AMD are in a constant race to provide a stronger chip which provides higher performance (with multiple cores) and higher bandwidth (with faster electrical signaling, wider datapaths, pipelined protocols, multiple paths and software support).
Server Direction
Figure 1 shows the number of processors that have been supported in a shared bus (for the past decade). A commonality between the technology appearing this decade and in the last decade is that servers throughout these decades supported either a single core or a dual core microprocessor. The industry has been inching towards supporting 100 microprocessors on a single shared bus. Because the bus has a fixed bandwidth, such an approach was bound to reach a dead end if new levels of indirection were not exploited. Indeed, new technologies have made supporting more microprocessors on a shared bus more feasible. Among these technologies are multiple cores per chip, deeper levels of caching and better addressing schemes. Consider a microprocessor with multiple cores as a node. Nodes communicate, and it is left up to the microprocessor to arbitrate between cores, thus relieving the shared bus from this addressing strain. With the constant improvements in multiple core support within a chip, it is possible to see servers with over two hundred cores as soon as this decade.
A different class of servers is emerging which is neither an SMP or a cluster. It is called ccNUMA. ccNUMA servers utilize Cache-Coherent Non-Uniform Memory Access. Such servers provide better memory access time to local memory. However, the different copies of the same data are kept up to date through cache-coherency protocols. Such technology is being supported by Intel and AMD. Another server manufacturer supporting this technology is SGI, with its Origin 350 server supporting up to 32 microprocessors.
As microprocessors become faster, and more and more microprocessors (all sharing a common bus) are added to a system, the bandwidth of the bus becomes ever more critical. As shown in Figure 2, the shared bus bandwidth of commercial multiprocessors has increased with time. Various technologies and techniques have been implemented to increase bus bandwidth, such as faster electrical signaling, wider datapaths, pipelined protocols, and multiple paths. In 2001, a bidirectional serial/parallel high-bandwidth, low-latency point to point link called HyperTransport (HT) was introduced. HT runs from 200 MHz to 2.6 GHz. It is used in many processors and in high-performance computing. HT has also been used as an interconnect for NUMA multiprocessor systems (see above).
Techniques have also been implemented to alleviate the strain put on the bus. With the Pentium III, Intel introduced an instruction designed to reduce bus contention. This is called the PAUSE instructions, which eliminates the bus transactions that occur when spin lock code repeatedly tries to test and set a memory location.
References
Culler DE, Singh JP, Gupta A. Parallel Computer Architecture: A Hardware/Software Approach. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1999.
http://compoundsemiconductor.net/articles/news/11/1/25
http://www-03.ibm.com/servers/eserver/pseries/hardware/whitepapers/power/ppc_arch.html
http://www-05.ibm.com/se/news/sv/2007/05/power-timeline.html
http://www.amd.com/us-en/assets/content_type/DownloadableAssets/MPF_Hammer_Presentation.PDF
http://www.demandtech.com/Resources/Papers/Multiprocessor%20scalability.pdf
http://www.endian.net/details.aspx?ItemNo=655
http://www.hpcwire.com/hpc/1754487.html
http://www.hypertransport.org/
http://www.mbipr.com/whitepaper5.pdf
http://www.sgi.com/products/remarketed/offering.html
http://www.sun.com/processors/
http://www.theinquirer.net/?article=9235