CSC/ECE 506 Fall 2007/wiki2 5 as: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
Line 1: Line 1:
== Cache size and characteristics of multi-core processors ==
== Cache size and characteristics of multi-core processors ==


Wiki: Cache sizes in multicore architectures
The following tables shows the current tide of the multicore processors's cache.


Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.
Line 7: Line 7:




AMD Opteron Server/workstation
=== AMD ===
* Opteron Server/Workstation
2 cores, 4/22/2005
2 cores, 4/22/2005
L1: 64KB (Data) + 64KB (Instruction) per core
L1: 64KB (Data) + 64KB (Instruction) per core
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
Line 15: Line 15:
L2: 1 MB per core, 16way associative  
L2: 1 MB per core, 16way associative  


Athlon 64 X2 family 2 cores
* Athlon 64 X2 family
5/13/2005
2 cores, 5/13/2005
L1: 64KB (Data) + 64KB (Instruction) per core
L1: 64KB (Data) + 64KB (Instruction) per core
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
2way associative instruction cache,  
2way associative instruction cache,  
L2: up to 1 MB per core, 16way associative  
L2: up to 1 MB per core, 16way associative  
Athlon 64 FX High performance desktop
 
* Athlon 64 FX High performance desktop
L1: 64KB (Data) + 64KB (Instruction) per core
L1: 64KB (Data) + 64KB (Instruction) per core
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
2way associative instruction cache,  
2way associative instruction cache,  
L2: up to 1 MB per core, 16way associative
L2: up to 1 MB per core, 16way associative
Athlon64 ### L1: 64K Data + 64K Instr.
 
2way associative data cache, two 64bits operations per cycle, 3 cycle latency
* Turion 64 X2 Laptop
2way associative instruction cache,
L2 : 512K
16 way associative
Turion 64 X2 Laptop
L1: Total 256KB (128K per core)
L1: Total 256KB (128K per core)
L2: Total 1MB (512K per core)
L2: Total 1MB (512K per core)
ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment
ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment


Broadcom SiByte SB1250
=== Broadcom ===
* SiByte SB1250
Two scalable MIPS core, MESI
Two scalable MIPS core, MESI
L1: 32K data + 32K instruction
L1: 32K data + 32K instruction
Line 46: Line 44:
32 bytes cache line
32 bytes cache line


SB1255
* SB1255
L1: 32K data + 32K instruction
L1: 32K data + 32K instruction
Cache block 32 bytes
Cache block 32 bytes
Line 55: Line 53:
32 bytes cache line
32 bytes cache line


SB1455
* SB1455
L1: 32K data + 32K instruction
L1: 32K data + 32K instruction


Line 65: Line 63:
32K instruction cache, 64K local data memory
32K instruction cache, 64K local data memory


CT3600
=== CT3600 ===
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad


Cavium Networks Octeon 16 MIPS cores
=== Cavium Networks ===
* Octeon
16 MIPS cores
CN38XX/CN36XX 4 to 16 MIPS64 cores
CN38XX/CN36XX 4 to 16 MIPS64 cores
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)
32K instruction cache/8K data cache/2K write buffer per every MIPS core
32K instruction cache/8K data cache/2K write buffer per every MIPS core
IBM Cell In the PlayStation 3
 
=== IBM ===
* Cell
In the PlayStation 3
PowerPC based
PowerPC based
8 cores optimized for vector operation
8 cores optimized for vector operation
Line 78: Line 81:
2000
2000


Power4
* Power4
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)
     32K per CPU data cache (2way, 128byte line, LRU)
     32K per CPU data cache (2way, 128byte line, LRU)
Line 84: Line 87:
L3: 128M, 8way 512 byte line
L3: 128M, 8way 512 byte line


Power4+
* Power4+
L1: data 32K(2way set), instruction 64K(directly mapped)
L1: data 32K(2way set), instruction 64K(directly mapped)
L2: 3 x 0.5M shared by dual core, 40B/cycle per port
L2: 3 x 0.5M shared by dual core, 40B/cycle per port
L3: 32M
L3: 32M
Power5 Dual core


* Power5
Dual core
L1: I-64K(2way LRU), D-32K(4way, LRU)
L1: I-64K(2way LRU), D-32K(4way, LRU)
L2: three 0.625M, 10way, LRU, 128 byte line
L2: three 0.625M, 10way, LRU, 128 byte line
L3: 36M, 12way
L3: 36M, 12way
PowerPC 970MP Dual
 
* PowerPC 970MP Dual
In the Apple PowerMac
In the Apple PowerMac
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)
Line 99: Line 104:
L2: 1M per core, ECC
L2: 1M per core, ECC
Cache-coherency snooping protocol
Cache-coherency snooping protocol
Intel Core2 Quad
 
Xeon Quad core
=== Intel ===
* Core2 Quad
 
* Xeon Quad core
12/13/2006
12/13/2006


Xeon 5000
* Xeon 5000
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)
L2: 2MB per core
L2: 2MB per core


 
* Xeon MP7000
Xeon MP7000
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)
L2: 1MB per core or 2x2MB per core
L2: 1MB per core or 2x2MB per core


Xeon MP7300
* Xeon MP7300
L1: 32K data per core, 32K instruction per core
L1: 32K data per core, 32K instruction per core
L2: up tp 8MB, snoop filter, 4M shared L2 per die
L2: up tp 8MB, snoop filter, 4M shared L2 per die
Core Duo
Core Duo
Core2 Duo
* Core2 Duo
Xeon (x1xx) Dual core
Xeon (x1xx) Dual core


Xeon 5100
* Xeon 5100
L1 : 32KB (Data) + 32KB (Instruction) per core
L1 : 32KB (Data) + 32KB (Instruction) per core
L2 : 4MB (Shared)  
L2 : 4MB (Shared)  


Xeon 7100
* Xeon 7100
L1: 16K data
L1: 16K data
L2: 2M 8way ECC
L2: 2M 8way ECC
L3: up to 16M ECC. 16way
L3: up to 16M ECC. 16way


Itanium 2 Multi-core
* Itanium 2
Multi-core
Montecito
Montecito
Itanium2
L1:32KB
L1:32KB
L2:256KB
L2:256KB
L3:up to 9MB
L3:up to 9MB
PARISC PA8800 PA8800
 
=== PARISC ===
* PA8800
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle
L2: 32M  
L2: 32M  


Stream Processor Strom-1 fmaily 40 to 80 ALUs
=== Stream Processors ===
* Strom-1 fmaily
40 to 80 ALUs


SP16HP-G220, SP16-G160, SP8-G80
SP16HP-G220, SP16-G160, SP8-G80
L1: 32K data, 16K instruction
L1: 32K data, 16K instruction
96K VLIW instruction memory
96K VLIW instruction memory


Sun Microsystems UltraSPARC IV
=== Sun Microsystems ===
UltraSPARC IV+ UltraSPARC IV
* UltraSPARC IV
dual
dual
L1: 64K data, 32K instruction
L1: 64K data, 32K instruction
Line 155: Line 164:
2K prefetch cache
2K prefetch cache


UltraSPARC IV+
* UltraSPARC IV+
L2: 2M on-chip  
L2: 2M on-chip  
L3: 32M external
L3: 32M external


UlatraSPARC T1 8 cores
* UlatraSPARC T1
8 cores
L1: 16K instruction cache per core, 4way-set associative parity protected
L1: 16K instruction cache per core, 4way-set associative parity protected
8K data cache , parity protected, 4way set associative
8K data cache , parity protected, 4way set associative

Revision as of 22:16, 24 September 2007

Cache size and characteristics of multi-core processors

The following tables shows the current tide of the multicore processors's cache.

Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.


AMD

  • Opteron Server/Workstation

2 cores, 4/22/2005 L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: 1 MB per core, 16way associative

  • Athlon 64 X2 family

2 cores, 5/13/2005 L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: up to 1 MB per core, 16way associative

  • Athlon 64 FX High performance desktop

L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: up to 1 MB per core, 16way associative

  • Turion 64 X2 Laptop

L1: Total 256KB (128K per core) L2: Total 1MB (512K per core) ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment

Broadcom

  • SiByte SB1250

Two scalable MIPS core, MESI L1: 32K data + 32K instruction Cache block 32 bytes Cache line 32 bytes L1 to L1 latency : 28~36 cycles

L2: 512K shared, ECC, 4way associative 32 bytes cache line

  • SB1255

L1: 32K data + 32K instruction Cache block 32 bytes Cache line 32 bytes L1 to L1 latency : 28~36 cycles

L2: 512K shared, ECC, 4way associative 32 bytes cache line

  • SB1455

L1: 32K data + 32K instruction

L2: 1MB shared, 8way associative, ECC protected Cradle Technology CT3400 CT3600 Multi-core DSP CT3400 8 32bits DSPs, 6 RISC-like CPUs 32K instruction cache, 64K local data memory

CT3600

2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad

Cavium Networks

  • Octeon

16 MIPS cores CN38XX/CN36XX 4 to 16 MIPS64 cores 1M ECC protected shared L2 (CN38XX), 512K (CN36XX) 32K instruction cache/8K data cache/2K write buffer per every MIPS core

IBM

  • Cell

In the PlayStation 3 PowerPC based 8 cores optimized for vector operation Power4 1st dual core 2000

  • Power4

L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)

    32K per CPU data cache (2way, 128byte line, LRU)

L2: 1440K, 8way, 128 byte line L3: 128M, 8way 512 byte line

  • Power4+

L1: data 32K(2way set), instruction 64K(directly mapped) L2: 3 x 0.5M shared by dual core, 40B/cycle per port L3: 32M

  • Power5

Dual core L1: I-64K(2way LRU), D-32K(4way, LRU) L2: three 0.625M, 10way, LRU, 128 byte line L3: 36M, 12way

  • PowerPC 970MP Dual

In the Apple PowerMac L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)

L2: 1M per core, ECC Cache-coherency snooping protocol

Intel

  • Core2 Quad
  • Xeon Quad core

12/13/2006

  • Xeon 5000

L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core) L2: 2MB per core

  • Xeon MP7000

L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core) L2: 1MB per core or 2x2MB per core

  • Xeon MP7300

L1: 32K data per core, 32K instruction per core L2: up tp 8MB, snoop filter, 4M shared L2 per die Core Duo

  • Core2 Duo

Xeon (x1xx) Dual core

  • Xeon 5100

L1 : 32KB (Data) + 32KB (Instruction) per core L2 : 4MB (Shared)

  • Xeon 7100

L1: 16K data L2: 2M 8way ECC L3: up to 16M ECC. 16way

  • Itanium 2

Multi-core Montecito L1:32KB L2:256KB L3:up to 9MB

PARISC

  • PA8800

L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle L2: 32M

Stream Processors

  • Strom-1 fmaily

40 to 80 ALUs

SP16HP-G220, SP16-G160, SP8-G80 L1: 32K data, 16K instruction 96K VLIW instruction memory

Sun Microsystems

  • UltraSPARC IV

dual L1: 64K data, 32K instruction L2: up to 16MB, external 8M 2way set associative per core Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC write cache: hash-indexed 2K, 2K prefetch cache

  • UltraSPARC IV+

L2: 2M on-chip L3: 32M external

  • UlatraSPARC T1

8 cores L1: 16K instruction cache per core, 4way-set associative parity protected 8K data cache , parity protected, 4way set associative

L2: 3M 12way, 4banks, ECC

Cache size and characteristics of single-core processors

Conclusion

References

[1] http://www.amd.com/us-en/Processors/ProductInformation

[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250

[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor

[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets

[5] http://www.sun.com/processors/UltraSPARC-IV/

[6] http://www.sun.com/processors/UltraSPARC-IV+/

[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml

[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html

[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html

[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html

[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html