Talk:CSC 456 Fall 2013/4a bc

2013-11-12T16:54:40Z

Cmbeverl:

CSC 456 Fall 2013/4a bc

2013-11-12T16:39:40Z

Cmbeverl:

=Load Balancing=
In multi-processor systems, load-balancing is used to break up and distribute the work load to individual processors in order to make effective use of processor time. When the work load is divided up at compile-time, the balance is said to be ''statically'' balanced. Dividing the work load up during run-time is ''dynamically'' balancing the load. Static load balancing has reduced overhead as the work is divided before run time. Dynamic load balancing assigns work as processors become idle, so there is greater overhead. However, dynamic balancing can lead to increased performance of load balancing due to being able to assign work to a processor when it does become idle, reducing the overall idle time of processors.

==Static vs. Dynamic Techniques==

==='''Static Load balancing'''===

====Round Robin====

Round robin is a load balancing technique which evenly distributes tasks across available processors. Each processor is lined up, and given a task one after the other until it loops around again back to the first processor. Visualize a dealer in a casino passing out cards to each player in a circle, one at a time. The advantage is that this is a very simple load balancing technique to implement, with very little overhead. A disadvantage is that there is no care given to the job size or performance. This can create problems if a processor is unlucky and is continually assigned large tasks, causing it to fall behind.

====Random====

Random load balancing relies on the hope that over the course of enough time, work loads are evenly spread by random chance. Random is fairly easy to implement with little overhead. Generating good "random" values is one challenge, because the function is called so many times that any bias will have a large affect. Random suffers from the same drawbacks as round robin though. There is always the chance that a certain processor is randomly picked in an unusually frequent fashion, leading to wait times for other processors. Random could also assign multiple large tasks to a single processor in a short period of time, which would also lead to uneven load balancing.

====Central Manager====

Central manager is a load balancing scheme which selects a certain processor to act as the "central node", which handles the balancing. The central node assigns each new task to the slave processor which currently has the least load. This method has a different overhead than usual. Before there would be intercommunication between all processors, where as with central load balancing, the communication exists solely between the central node and the other processors. A drawback of the Central Management is that it usually works best with smaller networks of processors. A hierarchy of master central nodes controlling lesser central nodes is possible, but adds more complexity. It is possible for a central control node to be inundated by messages from its children nodes, locking up the system and causing great drops in performance. The Central Manager policy has an advantage because it requires fewer messages to be sent in order to facilitate load balancing. This method also greatly reduces the chance that any one processor is overworked or left idle.

===Dynamic Load Balancing===

====Local Queue====
Under local queue work load management, also called distributed work load management, each processor is responsible for maintaining a sufficient work load. When a load drops below a threshold, the load manager for the processor fires off a request to another random processor work load manager to send work. The remote load manager receiving the request examines it's own work load and, if it has sufficient extra work load, will send work to the requesting load manager. This algorithm scheme is fault tolerant in that if any processor were to fail, the other nodes would be able to continue working as they still have their work load and can still manager work loads with other processors. Unfortunately, this scheme generally requires a relatively large amount of inter-processor communications to maintain a satisfactory work load at all processors.

====Central Queue====
A centralized work load manager is responsible for distributing work load to processors under the central queue algorithm. The central manager is aware of all work to be distributed to the processors. When a processor's load falls below a threshold, a request for more work is sent to the central load manager, which then distributes more work. If there is not enough work in the central queue to meet the demand, the request is buffered until there enough work is available to meet the request. In systems with large numbers of processors, clusters can be formed of groups of processors with each cluster have a centralized work load manager. One work load manager would be in charge of distributing work loads to each cluster work load manager. This scheme has a lower fault tolerance as the system can be at risk of being brought down if the central load manager were to stop working. Also, an entire cluster could stop producing of it's central load manager were to stop functioning.

==Real World applications of Load Balancing==
====Weather Modeling====

====Visible Human Project====

==Examples of Load Balancing in action==

Server Load balancing pseudocode

<code>
server_load_vec_desc = sort_descending(server_load_vec);
server_load_vec_asc = sort_ascending(server_load_vec);
while (server_load_vec_desc[0].deviation > DEVIATION_THRESHOLD) {
populate_range_load_vector(server_load_vec_desc[0].server_name);
sort descending range_load_vec;
i=0;
while (server_load_vec_desc[0].deviation > DEVIATION_THRESHOLD &&
i < range_load_vec.size()) {
if (moving range_load_vec[i] from server_load_vec_desc[0] to server_load_vec_asc[0] reduces deviation) {
add range_load_vec[i] to balance plan
partial_deviation = range_load_vec[i].loadestimate * loadavg_per_loadestimate;
server_load_vec_desc[0].loadavg -= partial_deviation;
server_load_vec_desc[0].deviation -= partial_deviation;
server_load_vec_asc[0].loadavg += partial_deviation;
server_load_vec_asc[0].deviation += partial_deviation;
server_load_vec_asc = sort_ascending(server_load_vec_asc);
}
i++;
}
if (i == range_load_vec.size())
remove server_load_vec_desc[0] and corresponding entry in server_load_vec_asc
server_load_vec_desc = sort_descending(server_load_vec_desc);
}

</code>

==Sources==

<ol>
<li>[http://code.google.com/p/hypertable/wiki/LoadBalancing Load Balancing PseudoCode and other information] </li>
<li>[http://paper.ijcsns.org/07_book/201006/20100619.pdf A Guide to Dynamic Load Balancing in Distributed Computer Systems] </li>
<li>[http://www.ics.uci.edu/~cs237/reading/parallel.pdf Strategies for Dynamic Load Balancing on Highly Parallel Computers] </li>
<li>[http://www.vsrdjournals.com/CSIT/Issue/2013_05_May/Web/1_Jagdeep_Singh_1670_Research_Article_VSRDIJCSIT_May_2013.docx SIMULATION OF STATIC LOAD BALANCING ALGORITHMS ON HOMOGENEOUS AND HETEROGENEOUS CPUs ] </li>
</ol>

2013-10-01T15:48:08Z

Cmbeverl:

Main Page/CSC 456 Fall 2013/1a bc

2013-09-24T22:31:55Z

Cmbeverl:

Edited from http://wiki.expertiza.ncsu.edu/index.php/Chapter_1:_Nick_Nicholls,_Albert_Chu

Since 2006, parallel computers have continued to evolve. Besides the increasing number of transistors (as predicted by Moore's law), other designs and architectures have increased in prominence. These include Chip Multi-Processors, cluster computing, and mobile processors.

==Transistor Count==
At the most fundamental level of parallel computing development is the transistor count. According to the text, since 1971 the number of transistors on a chip has increased from 2,300 to 167 million in 2006. By 2011, the transistor count had further increased to 2.6 billion, a 1,130,434x increase from 1971. The clock frequency has also continued to rise, if a bit slower since 2006. In 2006, the clock speed was around 2.4GHz, 3,200 times the speed of 750KHz in 1971. By 2011, the high end clock speed of a processor is in the 3.3GHz range.

====Evolution of Intel Processors====
{| class="wikitable"
|+ Table 1.1: Evolution of Intel Processors
|-
! From
! Procs
! Transistors
! Specifications
! New Features
|-
| 2000
| Pentium IV
| 55 Million
| 1.4-3GHz
| hyper-pipelining, SMT
|-
| 2006
| Xeon
| 167 Million
| 64-bit, 2GHz, 4MB L2 cache on chip
| Dual core, virtualization support
|-
| 2007
| Core 2 Allendale
| 167 Million
| 1.8-2.6 GHz, 2MB L2 cache
| 2 CPUs on one die, Trusted Execution Technology
|-
| 2008
| Xeon
| 820 Million
| 2.5-2.83 GHz, 6MB L3 cache
|
|-
| 2009
| Core i7 Lynnfield
| 774 Million
| 2.66-2.93 GHz, 8MB L3 cache
| 2-channel DDR3
|-
| 2010
| Core i7 Gulftown
| 1.17 Billion
| 3.2 GHz
| 32 nm
|-
| 2011
| Core i7 Sandy Bridge EP4
| 1.2 Billion
| 3.2-3.3 GHz, 32 KB L1 cache per core, 256 KB L2 cache, 20 MB L3 cache
| Up to 8 cores
|-
|2012
| Core i7 Ivy Bridge
| 1.2 Billion
| 2.5-3.7 GHz
| 22 nm, 3D Tri-gate transistors
|-
|2013
| Core Haswell
| 1.4 Billion
| 2.5-3.7 GHz
| Fully integrated voltage regulator
|}

==Chip Multi-Processors==

With the sophistication of processors and increasing clock speeds, effort was placed on parallelism. The high clock speed could be broken down into a large pipeline; this large pipeline allowed big performance gains with instruction level parallelism (ILP). Instruction level parallelism is the act of executing multiple instructions at the same time. This would be implemented in a single core, with each stage of the pipeline being executed in each clock cycle. By the 1970s the gains from ILP were significant enough to allow uni-processor systems to reach the level of performance in parallel computers after only a few years. This inhibited adoption of multi-processors as it was costly and not needed. Of course, the performance gains of ILP was soon limited. Once branch prediction had a success rate of 90%, there was little room for further improvement. At this point, the main way of increasing performance was to increase the clock speed. This also meant more power consumption.

As the diminishing returns and power inefficiencies of ILP progressed, manufacturers began to turn towards chip multi-processors (i.e. multicore architectures). These systems allowed task parallelism in addition to ILP. For example, one processor can execute multiple tasks simultaneously, and each core can use ILP with pipelining. Driven by the gains of multi-processors, the amount of cores on a chip has continued to increase since 2006. By 2011, Intel and IBM were producing 8-core processors. For servers, AMD was producing up to 16-core processors.
{| class="wikitable"
|+ Table 1.2: Examples of current multicore processors
|-
! Aspects
! Intel Sandy Bridge
! AMD Valencia
! IBM POWER7
|-
! # Cores
| 4
| 8
| 8
|-
! Clock Freq.
| 3.5GHz
| 3.3GHz
| 3.55GHz
|-
! Clock Type
| OOO Superscalar
| OOO Superscalar
| SIMD
|-
! Caches
| 8MB L3
| 8MB L3
| 32MB L3
|-
! Chip Power
| 95 Watts
| 95 Watts
| 650 Watts for the whole system
|}

==Cluster Computers==
The 1990s saw a rise in the use of cluster computers, or distributed super computers. These systems take advantage of the power of individual processors, and combine them to create a powerful unified system. Originally, cluster computers only used uniprocessors, but have since adopted the use of multi-processors. Unfortunately, the cost advantage mentioned by the book has largely dissipated, as many current implementations use expensive, high-end hardware.

One of the newer innovations in cluster computers is high-availability. These types of clusters operate with redundant nodes to minimize downtime when components fail. Such a system uses automated load-balancing algorithms to route traffic when a node fails. In order to function, high-availability clusters must be able to check and change the status of running applications. The applications must also use shared storage, while operating in a way such that its data is protected from corruption.

{| class="wikitable"
|+ Top500.org Cluster computers 2008 - 2013
|-
! Date of #1 Rank
! Name
! Number of Cores/Nodes
! Specifications
! Peak Performance
! Power Usage
! Information
|- valign="top"
| 2009 Jun
| Roadrunner
|
* 129,600 Cores
* 6,480 computing nodes
|
* AMD Opteron 2210 2-core
* IBM PowerXCell8i 8+1 cores
* 104 Terabytes RAM
* Infiniband interconnect
* OS - REHL and Fedora Linux
| 1.46 Petaflops
| 2.5 Megawatts
| Built by IBM, housed in NM, US
|- valign="top"
| 2010 Jun
| Jaguar
|
* 224,162 Cores
* 18,688 computing nodes
|
* AMD Opteron 2435 6-core
* AMD Opteron 1354 4-core
* 360 Terabytes RAM
* Cray Seastar2+, Infiniband interconnects
* OS - Cray Linux
| 2.33 Petaflops
| 7.0 Megawatts
| Built by Cray, housed in Tennessee, US
|- valign="top"
| 2010 Nov
| Tianhe-1A
|
* 186,368 Cores
* 7,168 computing nodes
|
* 2 Xeon X5670 6-core CPUs per node
* 1 Nvidia M2050 GPU per node
* 262 Terabytes RAM
* Arch interconnect (NUDT)
* OS - Linux variant
| 4.7 Petaflops
| 4.0 Megawatts
| Built by NUDT, China
|- valign="top"
| 2011 Nov
| K Computer
|
* 705,024 Cores
* 96 computing nodes
|
* 2.0GHz 8-core SPARC64 VIIIfx
* 6 I/O nodes
* Using Message Passing Interface
* Tofu 6-dimensional torus interconnect
* OS - Linux variant
| 11.28 Petaflops
| 9.89 Megawatts
| Built by Fujitsu, housed in Japan
|- valign="top"
| 2012 Jun
| Sequoia
|
* 1,572,864 Cores
* 98,304 computing nodes
|
* 16-core PowerPC A2, Blue Gene/Q
* 1.5 Petabytes RAM
* 5-dimensional torus interconnect
* OS - Linux variant
| 20.13 Petaflops
| 7.9 Megawatts
| Built by IBM, housed in California, US
|- valign="top"
| 2012 Nov
| Titan
|
* 560,640 computing cores
|
* AMD Opertons CPUs
* Nvidia Tesla GPUs
* 693 Terabytes RAM (CPU + GPU)
* Cray Gemini interconnect
* OS - Cray Linux
| 27.11 Petaflops
| 8.2 Megawatts
| Built by Cray, housed in California, US
|- valign="top"
| 2013 Jun
| Tianhe-2
|
* 3,120,000 Cores
* 16,000 nodes
|
* 2 Intel Xeon IvyBridge per node
* 3 Intel Xeon Phi per node
* 1.34 Petabytes RAM
* TH Express-2 fat tree topology (NUDT)
* OS - NUDT Kylin Linux
| 54.9 Petaflops
| 17.6 Megawatts
| Built by NUDT, China
|}

===Trends===
In 2011 the fastest super computer was Japan's K Computer, a cluster computer built by Fujitsu. Six months later, Sequoia replaced the K Computer as the top ranking cluster computer with a performance of 20.13 petaflops, a seventy-eight percent increase. Titan replaced the Sequoia as number in November 2012, with performance thirty-four percent greater than it's predecessor. The June 2013 top leader, Tianhe-2, displaced Titan with a one-hundred percent increase in performance.

Since 2008, super computers have trended towards using multi-core processors in the architecture. As of 2013, according to Top500.org data, trends have been to use processors with a high number of cores, eight or more. Most use computing nodes with multiple multi-core CPUs.

Graphical trends for super computers 2008-2013:
* [[Media:Top500_cores-per-socket.png|Top500.org Cores per socket]]
* [[Media:Top500_cores-per-socket-performance.png|Top500.org Performance for cores per socket]]
* [[Media:Top500 interconnect-family.png|Top500.org Interconnects used for super computers]]
* [[Media:Top500 vendors.png|Top500.org Vendor trends of super computers]]

==Mobile Processors==
Due to the popularity of smart phones, there has been significant development on mobile processors. This category of processors has been specifically designed for low power use. To conserve power, these types of processors use dynamic frequency scaling. This technology allows the processor to run at varying clock frequencies based on the current load.
{| class="wikitable"
|+ Examples of current mobile processors
|-
! Aspects
! Intel Atom N2800
! ARM Cortex-A9
|-
! # Cores
| 2
| 2
|-
! Clock Freq
| 1.86GHz
| 800MHz-2000MHz
|-
! Cache
| 1MB L2
| 4MB L2
|-
! Power
| 35 W
| .5W-1.9W
|}

==Sources==
<ol>
<li>http://en.wikipedia.org/wiki/Transistor_count</li>
<li>http://ark.intel.com/products/52220/Intel-Core-i3-2310M-Processor-%283M-Cache-2_10-GHz%29</li>
<li>http://www.tomshardware.com/news/intel-ivy-bridge-22nm-cpu-3d-transistor,14093.html</li>
<li>http://www.anandtech.com/show/5091/intel-core-i7-3960x-sandy-bridge-e-review-keeping-the-high-end-alive</li>
<li>http://www.chiplist.com/Intel_Core_2_Duo_E4xxx_series_processor_Allendale/tree3f-subsection--2249-/</li>
<li>http://www.pcper.com/reviews/Processors/Intel-Lynnfield-Core-i7-870-and-Core-i5-750-Processor-Review</li>
<li>http://www.intel.com/pressroom/kits/quickreffam.htm#Xeon</li>
<li>http://www.tomshardware.com/reviews/core-i7-980x-gulftown,2573-2.html</li>
<li>http://www.fujitsu.com/global/news/pr/archives/month/2011/20111102-02.html</li>
<li>http://ark.intel.com/products/61275</li>
<li>http://www.anandtech.com/show/5096/amd-releases-opteron-4200-valencia-and-6200-interlagos-series</li>
<li>http://www.arm.com/products/processors/cortex-a/cortex-a9.php</li>
<li>http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)</li>
<li>http://en.wikipedia.org/wiki/SPARC64_VI#SPARC64_VIIIfx</li>
<li>http://en.wikipedia.org/wiki/High-availability_cluster</li>
</ol>

Main Page/CSC 456 Fall 2013/1a bc

2013-09-24T22:26:16Z

Cmbeverl: /* Trends */

Since 2006, parallel computers have continued to evolve. Besides the increasing number of transistors (as predicted by Moore's law), other designs and architectures have increased in prominence. These include Chip Multi-Processors, cluster computing, and mobile processors.

==Transistor Count==
At the most fundamental level of parallel computing development is the transistor count. According to the text, since 1971 the number of transistors on a chip has increased from 2,300 to 167 million in 2006. By 2011, the transistor count had further increased to 2.6 billion, a 1,130,434x increase from 1971. The clock frequency has also continued to rise, if a bit slower since 2006. In 2006, the clock speed was around 2.4GHz, 3,200 times the speed of 750KHz in 1971. By 2011, the high end clock speed of a processor is in the 3.3GHz range.

====Evolution of Intel Processors====
{| class="wikitable"
|+ Table 1.1: Evolution of Intel Processors
|-
! From
! Procs
! Transistors
! Specifications
! New Features
|-
| 2000
| Pentium IV
| 55 Million
| 1.4-3GHz
| hyper-pipelining, SMT
|-
| 2006
| Xeon
| 167 Million
| 64-bit, 2GHz, 4MB L2 cache on chip
| Dual core, virtualization support
|-
| 2007
| Core 2 Allendale
| 167 Million
| 1.8-2.6 GHz, 2MB L2 cache
| 2 CPUs on one die, Trusted Execution Technology
|-
| 2008
| Xeon
| 820 Million
| 2.5-2.83 GHz, 6MB L3 cache
|
|-
| 2009
| Core i7 Lynnfield
| 774 Million
| 2.66-2.93 GHz, 8MB L3 cache
| 2-channel DDR3
|-
| 2010
| Core i7 Gulftown
| 1.17 Billion
| 3.2 GHz
| 32 nm
|-
| 2011
| Core i7 Sandy Bridge EP4
| 1.2 Billion
| 3.2-3.3 GHz, 32 KB L1 cache per core, 256 KB L2 cache, 20 MB L3 cache
| Up to 8 cores
|-
|2012
| Core i7 Ivy Bridge
| 1.2 Billion
| 2.5-3.7 GHz
| 22 nm, 3D Tri-gate transistors
|-
|2013
| Core Haswell
| 1.4 Billion
| 2.5-3.7 GHz
| Fully integrated voltage regulator
|}

==Chip Multi-Processors==

With the sophistication of processors and increasing clock speeds, effort was placed on parallelism. The high clock speed could be broken down into a large pipeline; this large pipeline allowed big performance gains with instruction level parallelism (ILP). Instruction level parallelism is the act of executing multiple instructions at the same time. This would be implemented in a single core, with each stage of the pipeline being executed in each clock cycle. By the 1970s the gains from ILP were significant enough to allow uni-processor systems to reach the level of performance in parallel computers after only a few years. This inhibited adoption of multi-processors as it was costly and not needed. Of course, the performance gains of ILP was soon limited. Once branch prediction had a success rate of 90%, there was little room for further improvement. At this point, the main way of increasing performance was to increase the clock speed. This also meant more power consumption.

As the diminishing returns and power inefficiencies of ILP progressed, manufacturers began to turn towards chip multi-processors (i.e. multicore architectures). These systems allowed task parallelism in addition to ILP. For example, one processor can execute multiple tasks simultaneously, and each core can use ILP with pipelining. Driven by the gains of multi-processors, the amount of cores on a chip has continued to increase since 2006. By 2011, Intel and IBM were producing 8-core processors. For servers, AMD was producing up to 16-core processors.
{| class="wikitable"
|+ Table 1.2: Examples of current multicore processors
|-
! Aspects
! Intel Sandy Bridge
! AMD Valencia
! IBM POWER7
|-
! # Cores
| 4
| 8
| 8
|-
! Clock Freq.
| 3.5GHz
| 3.3GHz
| 3.55GHz
|-
! Clock Type
| OOO Superscalar
| OOO Superscalar
| SIMD
|-
! Caches
| 8MB L3
| 8MB L3
| 32MB L3
|-
! Chip Power
| 95 Watts
| 95 Watts
| 650 Watts for the whole system
|}

==Cluster Computers==
The 1990s saw a rise in the use of cluster computers, or distributed super computers. These systems take advantage of the power of individual processors, and combine them to create a powerful unified system. Originally, cluster computers only used uniprocessors, but have since adopted the use of multi-processors. Unfortunately, the cost advantage mentioned by the book has largely dissipated, as many current implementations use expensive, high-end hardware.

One of the newer innovations in cluster computers is high-availability. These types of clusters operate with redundant nodes to minimize downtime when components fail. Such a system uses automated load-balancing algorithms to route traffic when a node fails. In order to function, high-availability clusters must be able to check and change the status of running applications. The applications must also use shared storage, while operating in a way such that its data is protected from corruption.

{| class="wikitable"
|+ Top500.org Cluster computers 2008 - 2013
|-
! Date of #1 Rank
! Name
! Number of Cores/Nodes
! Specifications
! Peak Performance
! Power Usage
! Information
|- valign="top"
| 2009 Jun
| Roadrunner
|
* 129,600 Cores
* 6,480 computing nodes
|
* AMD Opteron 2210 2-core
* IBM PowerXCell8i 8+1 cores
* 104 Terabytes RAM
* Infiniband interconnect
* OS - REHL and Fedora Linux
| 1.46 Petaflops
| 2.5 Megawatts
| Built by IBM, housed in NM, US
|- valign="top"
| 2010 Jun
| Jaguar
|
* 224,162 Cores
* 18,688 computing nodes
|
* AMD Opteron 2435 6-core
* AMD Opteron 1354 4-core
* 360 Terabytes RAM
* Cray Seastar2+, Infiniband interconnects
* OS - Cray Linux
| 2.33 Petaflops
| 7.0 Megawatts
| Built by Cray, housed in Tennessee, US
|- valign="top"
| 2010 Nov
| Tianhe-1A
|
* 186,368 Cores
* 7,168 computing nodes
|
* 2 Xeon X5670 6-core CPUs per node
* 1 Nvidia M2050 GPU per node
* 262 Terabytes RAM
* Arch interconnect (NUDT)
* OS - Linux variant
| 4.7 Petaflops
| 4.0 Megawatts
| Built by NUDT, China
|- valign="top"
| 2011 Nov
| K Computer
|
* 705,024 Cores
* 96 computing nodes
|
* 2.0GHz 8-core SPARC64 VIIIfx
* 6 I/O nodes
* Using Message Passing Interface
* Tofu 6-dimensional torus interconnect
* OS - Linux variant
| 11.28 Petaflops
| 9.89 Megawatts
| Built by Fujitsu, housed in Japan
|- valign="top"
| 2012 Jun
| Sequoia
|
* 1,572,864 Cores
* 98,304 computing nodes
|
* 16-core PowerPC A2, Blue Gene/Q
* 1.5 Petabytes RAM
* 5-dimensional torus interconnect
* OS - Linux variant
| 20.13 Petaflops
| 7.9 Megawatts
| Built by IBM, housed in California, US
|- valign="top"
| 2012 Nov
| Titan
|
* 560,640 computing cores
|
* AMD Opertons CPUs
* Nvidia Tesla GPUs
* 693 Terabytes RAM (CPU + GPU)
* Cray Gemini interconnect
* OS - Cray Linux
| 27.11 Petaflops
| 8.2 Megawatts
| Built by Cray, housed in California, US
|- valign="top"
| 2013 Jun
| Tianhe-2
|
* 3,120,000 Cores
* 16,000 nodes
|
* 2 Intel Xeon IvyBridge per node
* 3 Intel Xeon Phi per node
* 1.34 Petabytes RAM
* TH Express-2 fat tree topology (NUDT)
* OS - NUDT Kylin Linux
| 54.9 Petaflops
| 17.6 Megawatts
| Built by NUDT, China
|}

===Trends===
In 2011 the fastest super computer was Japan's K Computer, a cluster computer built by Fujitsu. Six months later, Sequoia replaced the K Computer as the top ranking cluster computer with a performance of 20.13 petaflops, a seventy-eight percent increase. Titan replaced the Sequoia as number in November 2012, with performance thirty-four percent greater than it's predecessor. The June 2013 top leader, Tianhe-2, displaced Titan with a one-hundred percent increase in performance.

Since 2008, super computers have trended towards using multi-core processors in the architecture. As of 2013, according to Top500.org data, trends have been to use processors with a high number of cores, eight or more. Most use computing nodes with multiple multi-core CPUs.

Graphical trends for super computers 2008-2013:
* [[Media:Top500_cores-per-socket.png|Top500.org Cores per socket]]
* [[Media:Top500_cores-per-socket-performance.png|Top500.org Performance for cores per socket]]
* [[Media:Top500 interconnect-family.png|Top500.org Interconnects used for super computers]]
* [[Media:Top500 vendors.png|Top500.org Vendor trends of super computers]]

==Mobile Processors==
Due to the popularity of smart phones, there has been significant development on mobile processors. This category of processors has been specifically designed for low power use. To conserve power, these types of processors use dynamic frequency scaling. This technology allows the processor to run at varying clock frequencies based on the current load.
{| class="wikitable"
|+ Examples of current mobile processors
|-
! Aspects
! Intel Atom N2800
! ARM Cortex-A9
|-
! # Cores
| 2
| 2
|-
! Clock Freq
| 1.86GHz
| 800MHz-2000MHz
|-
! Cache
| 1MB L2
| 4MB L2
|-
! Power
| 35 W
| .5W-1.9W
|}

==Sources==
<ol>
<li>http://en.wikipedia.org/wiki/Transistor_count</li>
<li>http://ark.intel.com/products/52220/Intel-Core-i3-2310M-Processor-%283M-Cache-2_10-GHz%29</li>
<li>http://www.tomshardware.com/news/intel-ivy-bridge-22nm-cpu-3d-transistor,14093.html</li>
<li>http://www.anandtech.com/show/5091/intel-core-i7-3960x-sandy-bridge-e-review-keeping-the-high-end-alive</li>
<li>http://www.chiplist.com/Intel_Core_2_Duo_E4xxx_series_processor_Allendale/tree3f-subsection--2249-/</li>
<li>http://www.pcper.com/reviews/Processors/Intel-Lynnfield-Core-i7-870-and-Core-i5-750-Processor-Review</li>
<li>http://www.intel.com/pressroom/kits/quickreffam.htm#Xeon</li>
<li>http://www.tomshardware.com/reviews/core-i7-980x-gulftown,2573-2.html</li>
<li>http://www.fujitsu.com/global/news/pr/archives/month/2011/20111102-02.html</li>
<li>http://ark.intel.com/products/61275</li>
<li>http://www.anandtech.com/show/5096/amd-releases-opteron-4200-valencia-and-6200-interlagos-series</li>
<li>http://www.arm.com/products/processors/cortex-a/cortex-a9.php</li>
<li>http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)</li>
<li>http://en.wikipedia.org/wiki/SPARC64_VI#SPARC64_VIIIfx</li>
<li>http://en.wikipedia.org/wiki/High-availability_cluster</li>
</ol>

File:Top500 vendors.png

2013-09-24T22:14:12Z

Cmbeverl: Top500.org Trend 2008-2013 Vendors of super computers

Top500.org Trend 2008-2013
Vendors of super computers

File:Top500 interconnect-family.png

2013-09-24T22:13:54Z

Cmbeverl: Top500.org Trend 2008-2013 Interconnects used

Top500.org Trend 2008-2013
Interconnects used

File:Top500 cores-per-socket-performance.png

2013-09-24T22:13:36Z

Cmbeverl: Top500.org Trend 2008-2013 Performance of cores per processor

Top500.org Trend 2008-2013
Performance of cores per processor

File:Top500 cores-per-socket.png

2013-09-24T22:12:51Z

Cmbeverl: Top500.org Trend 2008-2013 Number of cores per processor

Top500.org Trend 2008-2013
Number of cores per processor

Main Page/CSC 456 Fall 2013/1a bc

2013-09-24T21:49:00Z

Cmbeverl: /* Cluster Computers */

Since 2006, parallel computers have continued to evolve. Besides the increasing number of transistors (as predicted by Moore's law), other designs and architectures have increased in prominence. These include Chip Multi-Processors, cluster computing, and mobile processors.

==Transistor Count==
At the most fundamental level of parallel computing development is the transistor count. According to the text, since 1971 the number of transistors on a chip has increased from 2,300 to 167 million in 2006. By 2011, the transistor count had further increased to 2.6 billion, a 1,130,434x increase from 1971. The clock frequency has also continued to rise, if a bit slower since 2006. In 2006, the clock speed was around 2.4GHz, 3,200 times the speed of 750KHz in 1971. By 2011, the high end clock speed of a processor is in the 3.3GHz range.

====Evolution of Intel Processors====
{| class="wikitable"
|+ Table 1.1: Evolution of Intel Processors
|-
! From
! Procs
! Transistors
! Specifications
! New Features
|-
| 2000
| Pentium IV
| 55 Million
| 1.4-3GHz
| hyper-pipelining, SMT
|-
| 2006
| Xeon
| 167 Million
| 64-bit, 2GHz, 4MB L2 cache on chip
| Dual core, virtualization support
|-
| 2007
| Core 2 Allendale
| 167 Million
| 1.8-2.6 GHz, 2MB L2 cache
| 2 CPUs on one die, Trusted Execution Technology
|-
| 2008
| Xeon
| 820 Million
| 2.5-2.83 GHz, 6MB L3 cache
|
|-
| 2009
| Core i7 Lynnfield
| 774 Million
| 2.66-2.93 GHz, 8MB L3 cache
| 2-channel DDR3
|-
| 2010
| Core i7 Gulftown
| 1.17 Billion
| 3.2 GHz
| 32 nm
|-
| 2011
| Core i7 Sandy Bridge EP4
| 1.2 Billion
| 3.2-3.3 GHz, 32 KB L1 cache per core, 256 KB L2 cache, 20 MB L3 cache
| Up to 8 cores
|-
|2012
| Core i7 Ivy Bridge
| 1.2 Billion
| 2.5-3.7 GHz
| 22 nm, 3D Tri-gate transistors
|-
|2013
| Core Haswell
| 1.4 Billion
| 2.5-3.7 GHz
| Fully integrated voltage regulator
|}

==Chip Multi-Processors==

With the sophistication of processors and increasing clock speeds, effort was placed on parallelism. The high clock speed could be broken down into a large pipeline; this large pipeline allowed big performance gains with instruction level parallelism (ILP). Instruction level parallelism is the act of executing multiple instructions at the same time. This would be implemented in a single core, with each stage of the pipeline being executed in each clock cycle. By the 1970s the gains from ILP were significant enough to allow uni-processor systems to reach the level of performance in parallel computers after only a few years. This inhibited adoption of multi-processors as it was costly and not needed. Of course, the performance gains of ILP was soon limited. Once branch prediction had a success rate of 90%, there was little room for further improvement. At this point, the main way of increasing performance was to increase the clock speed. This also meant more power consumption.

As the diminishing returns and power inefficiencies of ILP progressed, manufacturers began to turn towards chip multi-processors (i.e. multicore architectures). These systems allowed task parallelism in addition to ILP. For example, one processor can execute multiple tasks simultaneously, and each core can use ILP with pipelining. Driven by the gains of multi-processors, the amount of cores on a chip has continued to increase since 2006. By 2011, Intel and IBM were producing 8-core processors. For servers, AMD was producing up to 16-core processors.
{| class="wikitable"
|+ Table 1.2: Examples of current multicore processors
|-
! Aspects
! Intel Sandy Bridge
! AMD Valencia
! IBM POWER7
|-
! # Cores
| 4
| 8
| 8
|-
! Clock Freq.
| 3.5GHz
| 3.3GHz
| 3.55GHz
|-
! Clock Type
| OOO Superscalar
| OOO Superscalar
| SIMD
|-
! Caches
| 8MB L3
| 8MB L3
| 32MB L3
|-
! Chip Power
| 95 Watts
| 95 Watts
| 650 Watts for the whole system
|}

==Cluster Computers==
The 1990s saw a rise in the use of cluster computers, or distributed super computers. These systems take advantage of the power of individual processors, and combine them to create a powerful unified system. Originally, cluster computers only used uniprocessors, but have since adopted the use of multi-processors. Unfortunately, the cost advantage mentioned by the book has largely dissipated, as many current implementations use expensive, high-end hardware.

One of the newer innovations in cluster computers is high-availability. These types of clusters operate with redundant nodes to minimize downtime when components fail. Such a system uses automated load-balancing algorithms to route traffic when a node fails. In order to function, high-availability clusters must be able to check and change the status of running applications. The applications must also use shared storage, while operating in a way such that its data is protected from corruption.

{| class="wikitable"
|+ Top500.org Cluster computers 2008 - 2013
|-
! Date of #1 Rank
! Name
! Number of Cores/Nodes
! Specifications
! Peak Performance
! Power Usage
! Information
|- valign="top"
| 2009 Jun
| Roadrunner
|
* 129,600 Cores
* 6,480 computing nodes
|
* AMD Opteron 2210 2-core
* IBM PowerXCell8i 8+1 cores
* 104 Terabytes RAM
* Infiniband interconnect
* OS - REHL and Fedora Linux
| 1.46 Petaflops
| 2.5 Megawatts
| Built by IBM, housed in NM, US
|- valign="top"
| 2010 Jun
| Jaguar
|
* 224,162 Cores
* 18,688 computing nodes
|
* AMD Opteron 2435 6-core
* AMD Opteron 1354 4-core
* 360 Terabytes RAM
* Cray Seastar2+, Infiniband interconnects
* OS - Cray Linux
| 2.33 Petaflops
| 7.0 Megawatts
| Built by Cray, housed in Tennessee, US
|- valign="top"
| 2010 Nov
| Tianhe-1A
|
* 186,368 Cores
* 7,168 computing nodes
|
* 2 Xeon X5670 6-core CPUs per node
* 1 Nvidia M2050 GPU per node
* 262 Terabytes RAM
* Arch interconnect (NUDT)
* OS - Linux variant
| 4.7 Petaflops
| 4.0 Megawatts
| Built by NUDT, China
|- valign="top"
| 2011 Nov
| K Computer
|
* 705,024 Cores
* 96 computing nodes
|
* 2.0GHz 8-core SPARC64 VIIIfx
* 6 I/O nodes
* Using Message Passing Interface
* Tofu 6-dimensional torus interconnect
* OS - Linux variant
| 11.28 Petaflops
| 9.89 Megawatts
| Built by Fujitsu, housed in Japan
|- valign="top"
| 2012 Jun
| Sequoia
|
* 1,572,864 Cores
* 98,304 computing nodes
|
* 16-core PowerPC A2, Blue Gene/Q
* 1.5 Petabytes RAM
* 5-dimensional torus interconnect
* OS - Linux variant
| 20.13 Petaflops
| 7.9 Megawatts
| Built by IBM, housed in California, US
|- valign="top"
| 2012 Nov
| Titan
|
* 560,640 computing cores
|
* AMD Opertons CPUs
* Nvidia Tesla GPUs
* 693 Terabytes RAM (CPU + GPU)
* Cray Gemini interconnect
* OS - Cray Linux
| 27.11 Petaflops
| 8.2 Megawatts
| Built by Cray, housed in California, US
|- valign="top"
| 2013 Jun
| Tianhe-2
|
* 3,120,000 Cores
* 16,000 nodes
|
* 2 Intel Xeon IvyBridge per node
* 3 Intel Xeon Phi per node
* 1.34 Petabytes RAM
* TH Express-2 fat tree topology (NUDT)
* OS - NUDT Kylin Linux
| 54.9 Petaflops
| 17.6 Megawatts
| Built by NUDT, China
|}

===Trends===
In 2011 the fastest super computer was Japan's K Computer, a cluster computer built by Fujitsu. Six months later, Sequoia replaced the K Computer as the top ranking cluster computer with a performance of 20.13 petaflops, a seventy-eight percent increase. Titan replaced the Sequoia as number in November 2012, with performance 34% greater than it's predecessor. The June 2013 top leader, Tianhe-2, displaced Titan with a one-hundred percent increase in performance.

==Mobile Processors==
Due to the popularity of smart phones, there has been significant development on mobile processors. This category of processors has been specifically designed for low power use. To conserve power, these types of processors use dynamic frequency scaling. This technology allows the processor to run at varying clock frequencies based on the current load.
{| class="wikitable"
|+ Examples of current mobile processors
|-
! Aspects
! Intel Atom N2800
! ARM Cortex-A9
|-
! # Cores
| 2
| 2
|-
! Clock Freq
| 1.86GHz
| 800MHz-2000MHz
|-
! Cache
| 1MB L2
| 4MB L2
|-
! Power
| 35 W
| .5W-1.9W
|}

==Sources==
<ol>
<li>http://en.wikipedia.org/wiki/Transistor_count</li>
<li>http://ark.intel.com/products/52220/Intel-Core-i3-2310M-Processor-%283M-Cache-2_10-GHz%29</li>
<li>http://www.tomshardware.com/news/intel-ivy-bridge-22nm-cpu-3d-transistor,14093.html</li>
<li>http://www.anandtech.com/show/5091/intel-core-i7-3960x-sandy-bridge-e-review-keeping-the-high-end-alive</li>
<li>http://www.chiplist.com/Intel_Core_2_Duo_E4xxx_series_processor_Allendale/tree3f-subsection--2249-/</li>
<li>http://www.pcper.com/reviews/Processors/Intel-Lynnfield-Core-i7-870-and-Core-i5-750-Processor-Review</li>
<li>http://www.intel.com/pressroom/kits/quickreffam.htm#Xeon</li>
<li>http://www.tomshardware.com/reviews/core-i7-980x-gulftown,2573-2.html</li>
<li>http://www.fujitsu.com/global/news/pr/archives/month/2011/20111102-02.html</li>
<li>http://ark.intel.com/products/61275</li>
<li>http://www.anandtech.com/show/5096/amd-releases-opteron-4200-valencia-and-6200-interlagos-series</li>
<li>http://www.arm.com/products/processors/cortex-a/cortex-a9.php</li>
<li>http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)</li>
<li>http://en.wikipedia.org/wiki/SPARC64_VI#SPARC64_VIIIfx</li>
<li>http://en.wikipedia.org/wiki/High-availability_cluster</li>
</ol>

Main Page/CSC 456 Fall 2013/1a bc

2013-09-24T21:46:23Z

Cmbeverl: /* Cluster Computers */

Since 2006, parallel computers have continued to evolve. Besides the increasing number of transistors (as predicted by Moore's law), other designs and architectures have increased in prominence. These include Chip Multi-Processors, cluster computing, and mobile processors.

==Transistor Count==
At the most fundamental level of parallel computing development is the transistor count. According to the text, since 1971 the number of transistors on a chip has increased from 2,300 to 167 million in 2006. By 2011, the transistor count had further increased to 2.6 billion, a 1,130,434x increase from 1971. The clock frequency has also continued to rise, if a bit slower since 2006. In 2006, the clock speed was around 2.4GHz, 3,200 times the speed of 750KHz in 1971. By 2011, the high end clock speed of a processor is in the 3.3GHz range.

====Evolution of Intel Processors====
{| class="wikitable"
|+ Table 1.1: Evolution of Intel Processors
|-
! From
! Procs
! Transistors
! Specifications
! New Features
|-
| 2000
| Pentium IV
| 55 Million
| 1.4-3GHz
| hyper-pipelining, SMT
|-
| 2006
| Xeon
| 167 Million
| 64-bit, 2GHz, 4MB L2 cache on chip
| Dual core, virtualization support
|-
| 2007
| Core 2 Allendale
| 167 Million
| 1.8-2.6 GHz, 2MB L2 cache
| 2 CPUs on one die, Trusted Execution Technology
|-
| 2008
| Xeon
| 820 Million
| 2.5-2.83 GHz, 6MB L3 cache
|
|-
| 2009
| Core i7 Lynnfield
| 774 Million
| 2.66-2.93 GHz, 8MB L3 cache
| 2-channel DDR3
|-
| 2010
| Core i7 Gulftown
| 1.17 Billion
| 3.2 GHz
| 32 nm
|-
| 2011
| Core i7 Sandy Bridge EP4
| 1.2 Billion
| 3.2-3.3 GHz, 32 KB L1 cache per core, 256 KB L2 cache, 20 MB L3 cache
| Up to 8 cores
|-
|2012
| Core i7 Ivy Bridge
| 1.2 Billion
| 2.5-3.7 GHz
| 22 nm, 3D Tri-gate transistors
|-
|2013
| Core Haswell
| 1.4 Billion
| 2.5-3.7 GHz
| Fully integrated voltage regulator
|}

==Chip Multi-Processors==

With the sophistication of processors and increasing clock speeds, effort was placed on parallelism. The high clock speed could be broken down into a large pipeline; this large pipeline allowed big performance gains with instruction level parallelism (ILP). Instruction level parallelism is the act of executing multiple instructions at the same time. This would be implemented in a single core, with each stage of the pipeline being executed in each clock cycle. By the 1970s the gains from ILP were significant enough to allow uni-processor systems to reach the level of performance in parallel computers after only a few years. This inhibited adoption of multi-processors as it was costly and not needed. Of course, the performance gains of ILP was soon limited. Once branch prediction had a success rate of 90%, there was little room for further improvement. At this point, the main way of increasing performance was to increase the clock speed. This also meant more power consumption.

As the diminishing returns and power inefficiencies of ILP progressed, manufacturers began to turn towards chip multi-processors (i.e. multicore architectures). These systems allowed task parallelism in addition to ILP. For example, one processor can execute multiple tasks simultaneously, and each core can use ILP with pipelining. Driven by the gains of multi-processors, the amount of cores on a chip has continued to increase since 2006. By 2011, Intel and IBM were producing 8-core processors. For servers, AMD was producing up to 16-core processors.
{| class="wikitable"
|+ Table 1.2: Examples of current multicore processors
|-
! Aspects
! Intel Sandy Bridge
! AMD Valencia
! IBM POWER7
|-
! # Cores
| 4
| 8
| 8
|-
! Clock Freq.
| 3.5GHz
| 3.3GHz
| 3.55GHz
|-
! Clock Type
| OOO Superscalar
| OOO Superscalar
| SIMD
|-
! Caches
| 8MB L3
| 8MB L3
| 32MB L3
|-
! Chip Power
| 95 Watts
| 95 Watts
| 650 Watts for the whole system
|}

==Cluster Computers==
The 1990s saw a rise in the use of cluster computers, or distributed super computers. These systems take advantage of the power of individual processors, and combine them to create a powerful unified system. Originally, cluster computers only used uniprocessors, but have since adopted the use of multi-processors. Unfortunately, the cost advantage mentioned by the book has largely dissipated, as many current implementations use expensive, high-end hardware.

One of the newer innovations in cluster computers is high-availability. These types of clusters operate with redundant nodes to minimize downtime when components fail. Such a system uses automated load-balancing algorithms to route traffic when a node fails. In order to function, high-availability clusters must be able to check and change the status of running applications. The applications must also use shared storage, while operating in a way such that its data is protected from corruption.

{| class="wikitable"
|+ Top500.org Cluster computers 2008 - 2013
|-
! Date of #1 Rank
! Name
! Number of Cores/Nodes
! Specifications
! Peak Performance
! Power Usage
! Information
|- valign="top"
| 2009 Jun
| Roadrunner
|
* 129,600 Cores
* 6,480 computing nodes
|
* AMD Opteron 2210 2-core
* IBM PowerXCell8i 8+1 cores
* 104 Terabytes RAM
* Infiniband interconnect
* OS - REHL and Fedora Linux
| 1.46 Petaflops
| 2.5 Megawatts
| Built by IBM, housed in NM, US
|- valign="top"
| 2010 Jun
| Jaguar
|
* 224,162 Cores
* 18,688 computing nodes
|
* AMD Opteron 2435 6-core
* AMD Opteron 1354 4-core
* 360 Terabytes RAM
* Cray Seastar2+, Infiniband interconnects
* OS - Cray Linux
| 2.33 Petaflops
| 7.0 Megawatts
| Built by Cray, house in Tennessee, US
|- valign="top"
| 2010 Nov
| Tianhe-1A
|
* 186,368 Cores
* 7,168 computing nodes
|
* 2 Xeon X5670 6-core CPUs per node
* 1 Nvidia M2050 GPU per node
* 262 Terabytes RAM
* Arch interconnect (NUDT)
* OS - Linux variant
| 4.7 Petaflops
| 4.0 Megawatts
| Built by NUDT, China
|- valign="top"
| 2011 Nov
| K Computer
|
* 705,024 Cores
* 96 computing nodes
|
* 2.0GHz 8-core SPARC64 VIIIfx
* 6 I/O nodes
* Using Message Passing Interface
* Tofu 6-dimensional torus interconnect
* OS - Linux variant
| 11.28 Petaflops
| 9.89 Megawatts
| Built by Fujitsu, Housed in Japan
|- valign="top"
| 2012 Jun
| Sequoia
|
* 1,572,864 Cores
* 98,304 computing nodes
|
* 16-core PowerPC A2, Blue Gene/Q
* 1.5 Petabytes RAM
* 5-dimensional torus interconnect
* OS - Linux variant
| 20.13 Petaflops
| 7.9 Megawatts
| Built by IBM, Housed in California, US
|- valign="top"
| 2012 Nov
| Titan
|
* 560,640 computing cores
|
* AMD Opertons CPUs
* Nvidia Tesla GPUs
* 693 Terabytes RAM (CPU + GPU)
* Cray Gemini interconnect
* OS - Cray Linux
| 27.11 Petaflops
| 8.2 Megawatts
| Built by Cray, housed in California, US
|- valign="top"
| 2013 Jun
| Tianhe-2
|
* 3,120,000 Cores
* 16,000 nodes
|
* 2 Intel Xeon IvyBridge per node
* 3 Intel Xeon Phi per node
* 1.34 Petabytes RAM
* TH Express-2 fat tree topology (NUDT)
* OS - NUDT Kylin Linux
| 54.9 Petaflops
| 17.6 Megawatts
| Built by NUDT, China
|}

===Trends===
In 2011 the fastest super computer was Japan's K Computer, a cluster computer built by Fujitsu. Six months later, Sequoia replaced the K Computer as the top ranking cluster computer with a performance of 20.13 petaflops, a seventy-eight percent increase. Titan replaced the Sequoia as number in November 2012, with performance 34% greater than it's predecessor. The June 2013 top leader, Tianhe-2, displaced Titan with a one-hundred percent increase in performance.

==Mobile Processors==
Due to the popularity of smart phones, there has been significant development on mobile processors. This category of processors has been specifically designed for low power use. To conserve power, these types of processors use dynamic frequency scaling. This technology allows the processor to run at varying clock frequencies based on the current load.
{| class="wikitable"
|+ Examples of current mobile processors
|-
! Aspects
! Intel Atom N2800
! ARM Cortex-A9
|-
! # Cores
| 2
| 2
|-
! Clock Freq
| 1.86GHz
| 800MHz-2000MHz
|-
! Cache
| 1MB L2
| 4MB L2
|-
! Power
| 35 W
| .5W-1.9W
|}

==Sources==
<ol>
<li>http://en.wikipedia.org/wiki/Transistor_count</li>
<li>http://ark.intel.com/products/52220/Intel-Core-i3-2310M-Processor-%283M-Cache-2_10-GHz%29</li>
<li>http://www.tomshardware.com/news/intel-ivy-bridge-22nm-cpu-3d-transistor,14093.html</li>
<li>http://www.anandtech.com/show/5091/intel-core-i7-3960x-sandy-bridge-e-review-keeping-the-high-end-alive</li>
<li>http://www.chiplist.com/Intel_Core_2_Duo_E4xxx_series_processor_Allendale/tree3f-subsection--2249-/</li>
<li>http://www.pcper.com/reviews/Processors/Intel-Lynnfield-Core-i7-870-and-Core-i5-750-Processor-Review</li>
<li>http://www.intel.com/pressroom/kits/quickreffam.htm#Xeon</li>
<li>http://www.tomshardware.com/reviews/core-i7-980x-gulftown,2573-2.html</li>
<li>http://www.fujitsu.com/global/news/pr/archives/month/2011/20111102-02.html</li>
<li>http://ark.intel.com/products/61275</li>
<li>http://www.anandtech.com/show/5096/amd-releases-opteron-4200-valencia-and-6200-interlagos-series</li>
<li>http://www.arm.com/products/processors/cortex-a/cortex-a9.php</li>
<li>http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)</li>
<li>http://en.wikipedia.org/wiki/SPARC64_VI#SPARC64_VIIIfx</li>
<li>http://en.wikipedia.org/wiki/High-availability_cluster</li>
</ol>