CSC/ECE 506 Fall 2007/wiki2 5 as: Difference between revisions
m (→Reference) |
|||
Line 1: | Line 1: | ||
== Cache size and characteristics of multi-core processors == | == Cache size and characteristics of multi-core processors == | ||
The following tables shows the current tide of the multicore processors's cache. | |||
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs. | Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs. | ||
Line 7: | Line 7: | ||
AMD Opteron Server/ | === AMD === | ||
* Opteron Server/Workstation | |||
2 cores, 4/22/2005 | 2 cores, 4/22/2005 | ||
L1: 64KB (Data) + 64KB (Instruction) per core | L1: 64KB (Data) + 64KB (Instruction) per core | ||
2way associative data cache, two 64bits operations per cycle, 3 cycle latency | 2way associative data cache, two 64bits operations per cycle, 3 cycle latency | ||
Line 15: | Line 15: | ||
L2: 1 MB per core, 16way associative | L2: 1 MB per core, 16way associative | ||
* Athlon 64 X2 family | |||
5/13/2005 | 2 cores, 5/13/2005 | ||
L1: 64KB (Data) + 64KB (Instruction) per core | L1: 64KB (Data) + 64KB (Instruction) per core | ||
2way associative data cache, two 64bits operations per cycle, 3 cycle latency | 2way associative data cache, two 64bits operations per cycle, 3 cycle latency | ||
2way associative instruction cache, | 2way associative instruction cache, | ||
L2: up to 1 MB per core, 16way associative | L2: up to 1 MB per core, 16way associative | ||
* Athlon 64 FX High performance desktop | |||
L1: 64KB (Data) + 64KB (Instruction) per core | L1: 64KB (Data) + 64KB (Instruction) per core | ||
2way associative data cache, two 64bits operations per cycle, 3 cycle latency | 2way associative data cache, two 64bits operations per cycle, 3 cycle latency | ||
2way associative instruction cache, | 2way associative instruction cache, | ||
L2: up to 1 MB per core, 16way associative | L2: up to 1 MB per core, 16way associative | ||
* Turion 64 X2 Laptop | |||
L1: Total 256KB (128K per core) | L1: Total 256KB (128K per core) | ||
L2: Total 1MB (512K per core) | L2: Total 1MB (512K per core) | ||
ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment | ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment | ||
Broadcom SiByte SB1250 | === Broadcom === | ||
* SiByte SB1250 | |||
Two scalable MIPS core, MESI | Two scalable MIPS core, MESI | ||
L1: 32K data + 32K instruction | L1: 32K data + 32K instruction | ||
Line 46: | Line 44: | ||
32 bytes cache line | 32 bytes cache line | ||
SB1255 | * SB1255 | ||
L1: 32K data + 32K instruction | L1: 32K data + 32K instruction | ||
Cache block 32 bytes | Cache block 32 bytes | ||
Line 55: | Line 53: | ||
32 bytes cache line | 32 bytes cache line | ||
SB1455 | * SB1455 | ||
L1: 32K data + 32K instruction | L1: 32K data + 32K instruction | ||
Line 65: | Line 63: | ||
32K instruction cache, 64K local data memory | 32K instruction cache, 64K local data memory | ||
CT3600 | === CT3600 === | ||
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad | 2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad | ||
Cavium Networks Octeon 16 MIPS cores | === Cavium Networks === | ||
* Octeon | |||
16 MIPS cores | |||
CN38XX/CN36XX 4 to 16 MIPS64 cores | CN38XX/CN36XX 4 to 16 MIPS64 cores | ||
1M ECC protected shared L2 (CN38XX), 512K (CN36XX) | 1M ECC protected shared L2 (CN38XX), 512K (CN36XX) | ||
32K instruction cache/8K data cache/2K write buffer per every MIPS core | 32K instruction cache/8K data cache/2K write buffer per every MIPS core | ||
IBM Cell In the PlayStation 3 | |||
=== IBM === | |||
* Cell | |||
In the PlayStation 3 | |||
PowerPC based | PowerPC based | ||
8 cores optimized for vector operation | 8 cores optimized for vector operation | ||
Line 78: | Line 81: | ||
2000 | 2000 | ||
Power4 | * Power4 | ||
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU) | L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU) | ||
32K per CPU data cache (2way, 128byte line, LRU) | 32K per CPU data cache (2way, 128byte line, LRU) | ||
Line 84: | Line 87: | ||
L3: 128M, 8way 512 byte line | L3: 128M, 8way 512 byte line | ||
Power4+ | * Power4+ | ||
L1: data 32K(2way set), instruction 64K(directly mapped) | L1: data 32K(2way set), instruction 64K(directly mapped) | ||
L2: 3 x 0.5M shared by dual core, 40B/cycle per port | L2: 3 x 0.5M shared by dual core, 40B/cycle per port | ||
L3: 32M | L3: 32M | ||
* Power5 | |||
Dual core | |||
L1: I-64K(2way LRU), D-32K(4way, LRU) | L1: I-64K(2way LRU), D-32K(4way, LRU) | ||
L2: three 0.625M, 10way, LRU, 128 byte line | L2: three 0.625M, 10way, LRU, 128 byte line | ||
L3: 36M, 12way | L3: 36M, 12way | ||
* PowerPC 970MP Dual | |||
In the Apple PowerMac | In the Apple PowerMac | ||
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped) | L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped) | ||
Line 99: | Line 104: | ||
L2: 1M per core, ECC | L2: 1M per core, ECC | ||
Cache-coherency snooping protocol | Cache-coherency snooping protocol | ||
Intel Core2 Quad | |||
Xeon Quad core | === Intel === | ||
* Core2 Quad | |||
* Xeon Quad core | |||
12/13/2006 | 12/13/2006 | ||
Xeon 5000 | * Xeon 5000 | ||
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core) | L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core) | ||
L2: 2MB per core | L2: 2MB per core | ||
* Xeon MP7000 | |||
Xeon MP7000 | |||
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core) | L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core) | ||
L2: 1MB per core or 2x2MB per core | L2: 1MB per core or 2x2MB per core | ||
Xeon MP7300 | * Xeon MP7300 | ||
L1: 32K data per core, 32K instruction per core | L1: 32K data per core, 32K instruction per core | ||
L2: up tp 8MB, snoop filter, 4M shared L2 per die | L2: up tp 8MB, snoop filter, 4M shared L2 per die | ||
Core Duo | Core Duo | ||
Core2 Duo | * Core2 Duo | ||
Xeon (x1xx) Dual core | Xeon (x1xx) Dual core | ||
Xeon 5100 | * Xeon 5100 | ||
L1 : 32KB (Data) + 32KB (Instruction) per core | L1 : 32KB (Data) + 32KB (Instruction) per core | ||
L2 : 4MB (Shared) | L2 : 4MB (Shared) | ||
Xeon 7100 | * Xeon 7100 | ||
L1: 16K data | L1: 16K data | ||
L2: 2M 8way ECC | L2: 2M 8way ECC | ||
L3: up to 16M ECC. 16way | L3: up to 16M ECC. 16way | ||
* Itanium 2 | |||
Multi-core | |||
Montecito | Montecito | ||
L1:32KB | L1:32KB | ||
L2:256KB | L2:256KB | ||
L3:up to 9MB | L3:up to 9MB | ||
PARISC | |||
=== PARISC === | |||
* PA8800 | |||
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle | L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle | ||
L2: 32M | L2: 32M | ||
Stream | === Stream Processors === | ||
* Strom-1 fmaily | |||
40 to 80 ALUs | |||
SP16HP-G220, SP16-G160, SP8-G80 | SP16HP-G220, SP16-G160, SP8-G80 | ||
L1: 32K data, 16K instruction | L1: 32K data, 16K instruction | ||
96K VLIW instruction memory | 96K VLIW instruction memory | ||
Sun Microsystems | === Sun Microsystems === | ||
* UltraSPARC IV | |||
dual | dual | ||
L1: 64K data, 32K instruction | L1: 64K data, 32K instruction | ||
Line 155: | Line 164: | ||
2K prefetch cache | 2K prefetch cache | ||
UltraSPARC IV+ | * UltraSPARC IV+ | ||
L2: 2M on-chip | L2: 2M on-chip | ||
L3: 32M external | L3: 32M external | ||
* UlatraSPARC T1 | |||
8 cores | |||
L1: 16K instruction cache per core, 4way-set associative parity protected | L1: 16K instruction cache per core, 4way-set associative parity protected | ||
8K data cache , parity protected, 4way set associative | 8K data cache , parity protected, 4way set associative |
Revision as of 22:16, 24 September 2007
Cache size and characteristics of multi-core processors
The following tables shows the current tide of the multicore processors's cache.
Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.
AMD
- Opteron Server/Workstation
2 cores, 4/22/2005 L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: 1 MB per core, 16way associative
- Athlon 64 X2 family
2 cores, 5/13/2005 L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: up to 1 MB per core, 16way associative
- Athlon 64 FX High performance desktop
L1: 64KB (Data) + 64KB (Instruction) per core 2way associative data cache, two 64bits operations per cycle, 3 cycle latency 2way associative instruction cache, L2: up to 1 MB per core, 16way associative
- Turion 64 X2 Laptop
L1: Total 256KB (128K per core) L2: Total 1MB (512K per core) ARM MPCore container for ARM9 & ARM11 High-performance embedded and entertainment
Broadcom
- SiByte SB1250
Two scalable MIPS core, MESI L1: 32K data + 32K instruction Cache block 32 bytes Cache line 32 bytes L1 to L1 latency : 28~36 cycles
L2: 512K shared, ECC, 4way associative 32 bytes cache line
- SB1255
L1: 32K data + 32K instruction Cache block 32 bytes Cache line 32 bytes L1 to L1 latency : 28~36 cycles
L2: 512K shared, ECC, 4way associative 32 bytes cache line
- SB1455
L1: 32K data + 32K instruction
L2: 1MB shared, 8way associative, ECC protected Cradle Technology CT3400 CT3600 Multi-core DSP CT3400 8 32bits DSPs, 6 RISC-like CPUs 32K instruction cache, 64K local data memory
CT3600
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad
Cavium Networks
- Octeon
16 MIPS cores CN38XX/CN36XX 4 to 16 MIPS64 cores 1M ECC protected shared L2 (CN38XX), 512K (CN36XX) 32K instruction cache/8K data cache/2K write buffer per every MIPS core
IBM
- Cell
In the PlayStation 3 PowerPC based 8 cores optimized for vector operation Power4 1st dual core 2000
- Power4
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)
32K per CPU data cache (2way, 128byte line, LRU)
L2: 1440K, 8way, 128 byte line L3: 128M, 8way 512 byte line
- Power4+
L1: data 32K(2way set), instruction 64K(directly mapped) L2: 3 x 0.5M shared by dual core, 40B/cycle per port L3: 32M
- Power5
Dual core L1: I-64K(2way LRU), D-32K(4way, LRU) L2: three 0.625M, 10way, LRU, 128 byte line L3: 36M, 12way
- PowerPC 970MP Dual
In the Apple PowerMac L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)
L2: 1M per core, ECC Cache-coherency snooping protocol
Intel
- Core2 Quad
- Xeon Quad core
12/13/2006
- Xeon 5000
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core) L2: 2MB per core
- Xeon MP7000
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core) L2: 1MB per core or 2x2MB per core
- Xeon MP7300
L1: 32K data per core, 32K instruction per core L2: up tp 8MB, snoop filter, 4M shared L2 per die Core Duo
- Core2 Duo
Xeon (x1xx) Dual core
- Xeon 5100
L1 : 32KB (Data) + 32KB (Instruction) per core L2 : 4MB (Shared)
- Xeon 7100
L1: 16K data L2: 2M 8way ECC L3: up to 16M ECC. 16way
- Itanium 2
Multi-core Montecito L1:32KB L2:256KB L3:up to 9MB
PARISC
- PA8800
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle L2: 32M
Stream Processors
- Strom-1 fmaily
40 to 80 ALUs
SP16HP-G220, SP16-G160, SP8-G80 L1: 32K data, 16K instruction 96K VLIW instruction memory
Sun Microsystems
- UltraSPARC IV
dual L1: 64K data, 32K instruction L2: up to 16MB, external 8M 2way set associative per core Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC write cache: hash-indexed 2K, 2K prefetch cache
- UltraSPARC IV+
L2: 2M on-chip L3: 32M external
- UlatraSPARC T1
8 cores L1: 16K instruction cache per core, 4way-set associative parity protected 8K data cache , parity protected, 4way set associative
L2: 3M 12way, 4banks, ECC
Cache size and characteristics of single-core processors
Conclusion
References
[1] http://www.amd.com/us-en/Processors/ProductInformation
[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor
[5] http://www.sun.com/processors/UltraSPARC-IV/
[6] http://www.sun.com/processors/UltraSPARC-IV+/
[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml
[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html
[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html
[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html