CSC/ECE 506 Fall 2007/wiki1 4 a1: Difference between revisions

Revision as of 23:22, 10 September 2007

Architectural Trends

Fig.1 MIPS R10000 Block Diagram (From Fig. 2 of [3])

Fig.2 The number of trnasistors on a chip of Intel

Feature size means the minimum size of transistors or a wire width which are used for connectiong transistors and other circuit components. Feature sizes have decreased from 10 microns in 1971 to 0.18 microns in 2001. These advanced integrated circuit processes allowed the integration of one billion transistors on a single chip and enabled more complicated and faster microprocessor architecure which have evolved to the direction of increasing parallelism;ILP and TLP. With respect to microprocessor architecture, as superscaler processor prevails, several exploitable architectures were proposed; VLIW, superspeculative, simultaneous multithreading, chip multiprocessor and so on. These techniques tried to overcome the control and data hazard as deep pipelining and multiple issue overwhelms as well as to maximize the throughput of computing by TLP.

For example, MIPS R10000 is a superscalar processor executed by out of order manner, which has 6.8 million transistors on 16.64mm x 17.934 mm(298mm^2) dimension using 0.35um process. It fetches 4 instructions simultaneously and has totally 6 pipelines; 5 pipe lines for execution and 1 pipe line for fetching and decoding. Each execution pipelines can be categorized into 3 kinds of execution - integer, float and load/store.

VLIW

VLIW(Very Long Instruction Word) is one way to expedite ILP under multiple-issue processors. Multiple-issue processors can be attainable by two basics - superscalar and VLIW.

The big difference between superscalar and VLIW is located on the scheduling method of instructions. Whlie superscalar processors issue varying number of instructions per clock, which are scheduled either statically or dynamically.

In contrast to superscalar, VLIW is based on statically sceduled processing which is performed by the compiler. The compiler analyzes the programmer's instructions and then groups multiple independent instructions into a large packaged instruction. The first multiple-issue processors that required the instruction stream to be explicitly organized to avoid dependences used wide instructions with multiple operations per instruction. VLIW issues a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet with the parallelism among instructions explicitly indicated by the instruction. For example, MIPS R10000 has 2 integer functional units and 3 kinds of operands. Hence, the compiler can generate one instruction which contains 3 integer operations with the corresponding operands to each operation.

E.g. Trimedia, i860

Multi-threading

Fig.3 Four different approaches of using issue slots in superscalar processor (Redrawn from Fig 6.44 of [1])

Multi-threading enables exploiting thread-level parallelism(TLP) within a single processor. It allows multiple threads to share the functional units of a single processor by an overlapping manner. For this sharing, the processor has to maintain the duplicated state information of each thread-register file, PC, page table and so on. For example, while pursuing multithreading, if the processor fetches data from a slow memory, the processor switches a currently excuted thread to another program or program's thread which is ready to execute instead of stalling the previous thread for waiting the data. Although this does not make speed up of a particular program/thread, it can increase the overall system's computing throughput by reducing the CPU idle time.

The Simultaneous multithreading (SMT) is a kind of multithreading which uses the resources of a multiple-issue, dynamically scheduled processor to exploit TLP. At the same time it exploits ILP using the issue slots in a single clock cycle.

Multi-core

Fig.4 Intel® Pentium® processor Extreme Edition processor die [7]

Multi-core CPUs have multiple numbers of CPU cores on a single die, connected to each other through a shared L2 or L3 cache, an on-die bus, or an on-die crossbar switch. All the CPU cores on the die share interconnect components with which to interface to other processors and the rest of the system. These components include a FSB (Front Side Bus) interface, a memory controller, a cache coherent link to other processors, and a non-coherent link to the southbridge and I/O devices. Multi-core chips do more work per clock cycle, and thus can be designed to operate at lower frequencies than their single-core counterparts. Since power consumption goes up proportionally with frequency, multi-core architecture gives engineers a way to solve the problem of power consumption and cooling requirements.

E.g. Intel Pentium Extreme, Coreduo, Coreduo2

Speculative Execution

While trying to get more ILP, managing control dependencies becomes more important but more burden. To remove the pipeline stall, branch prediction is applied for the instruction fetching stage. However, for the processor which executes multiple instructions per clock, it is not sufficient to predict accurately. A wide issue processor needs to execute a branch every clock cycle to attain the maximum performance. Under speculative execution, fetch, issue, and execute instructions are performed as if branch predictions were always correct. When misprediction occurs, the recovery mechanism handles this situation.

E.g. PowerPC 603/604/G3/G4, MIPS R10000/R12000, Intel Pentium II/III/4, Alpha 21264, AMD K5/K6/Athlon

Updated Figure 1.8 & 1.9

References

[1] John L. Hennessy, David A. Patterson, "Computer Architecture: A Quantitative Approach" 3rd Ed., Morgan Kaufmann, CA, USA

[2] CE Kozyrakis, DA Patterson, "A new direction for computer architecture research", Computer Volume 31 Issue 11, IEEE, Nov 1998, pp24-32

[3] K.C. Yeager, "The MIPS R10000 Superscalar Microprocessor", IEEE Micro Volume 16 Issue 2, Apr. 1996, pp28-41

[4] Geoff Koch, "Discovering Multi-Core: Extending the Benefits of Moore’s Law", Technology@Intel Magazine, Jul 2005, pp1-6

[5] Richard Low, "Microprocessor trends:multicore, memory, and power developments", Embedded Computing Design, Sep 2005

[6] Artur Klauser, "Trends in High-Performance Microprocessor Design", Telematik 1, 2001

[7] http://www.intel.com & http://www.intel.com/pressroom/kits/pentiumee

[8] http://www.alimartech.com/9000_servers.htm

[9] http://www.sun.com/servers/index.jsp?gr0=cpu&fl0=cpu4&gr1=

[10] http://www.sgi.com/pdfs/3867.pdf

[11] http://www-03.ibm.com/systems/p/hardware/highend/590/index.html

@@ Line 56: / Line 56: @@
 == Updated Figure 1.8 & 1.9 ==
-[[Image:fig18.jpg|800px|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]
+[[Image:fig18.jpg|700px|Figure 1.8 Number of processors in fully configured commercial bus-based shared memory multiprocessors]]
 [[Image:fig19.jpg|700px|Figure 1.9 Bandwidth of the shared memory bus in commercial multiprocessors]]

CSC/ECE 506 Fall 2007/wiki1 4 a1: Difference between revisions

Revision as of 23:22, 10 September 2007

Contents

Architectural Trends