CSC/ECE 506 Fall 2007/wiki2 5 as: Difference between revisions

From Expertiza_Wiki
Jump to navigation Jump to search
 
(19 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Cache size and characteristics of multi-core processors ==
= Introduction =
Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose an appropriate structure  is an important and a hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. From this point, it might be valuable to look through the cache chracteristics of modern processors.


The following tables shows the current tide of the multicore processors's cache.
The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this psage, several cache parameters will be shown for modern multicore processors as well as for a couple of single-core processors.


Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.
= Cache sizes in multicore architectures =


''Topic'' - Create a table of caches used in current multicore architectures,
including such parameters as number of levels, line size, size and
associativity of each level, latency of each level, whether each level
is shared, and coherence protocol used. Compare this with two or three
recent single-core designs.




=== AMD ===
{| border="1" cellpadding="5" cellspacing="0" align="center"
{| border="1" cellspacing="0"  
|+'''Detail of Caches'''
!Processor
!L1 Cache
!L2 Cache
!L3 Cache
!Description
|-
|-
!Opteron Server/Workstation
! colspan="6" style="background:#ffdead;" | Multicore Processors
!64K Data + 64K Instruction per core
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency
2-way associative instruction cache,
!1 MB per core, 16-way associative
!
!Dual cores, introduced on 4/22/2005
|-
|-
!Athlon64 X2 family
! Processor Name
!64K Data + 64K Instruction per core
! Number of Levels
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency
! Line Size
2-way associative instruction cache,
! Cache Size
!up to 1 MB per core, 16-way associative
! Associativity
!  
! Coherence Protocol
!Dual cores, intoduced on 5/13/2005
|-
|-
!Athlon64 FX
| AMD Athlon 64 X2
!64K Data + 64K Instruction per core
| 2
2-way associative data cache, two 64bits operations per cycle, 3 cycle latency
| 64 bytes (for both L1 & L2)
2-way associative instruction cache,
| L1 - 64KB (Data) + 64KB (Instruction) per core<br/>L2 - 512KB to 1MB per core
!up to 1 MB per core, 16-way associative
| L1 - 2 way (Data and Instruction cache)<br/>L2 - 16 way associative
!
| Modified Owner Exclusive Shared Invalid (MOESI)
!High performance desktop
|-
|-
!Turion64 X2
| AMD Athlon 64 FX
!Total 256KB (128K per core)
| 2
!Total 1MB (512K per core)
| 64 bytes (for both L1 & L2)
!  
| L1 - 64KB (Data) + 64KB (Instruction) per core<br/>L2 - 1MB per core
!For Laptop
| L1 - 2 way (Data and Instruction cache)<br/>L2 - 16 way associative
| Modified Owner Exclusive Shared Invalid (MOESI)
|-
| AMD Athlon Opteron<br/>(marketed for servers)
| 2
| 64 bytes (for both L1 & L2)
| L1 - 64KB (Data) + 64KB (Instruction) per core<br/>L2 - 1MB per core
| L1 - 2 way (Data and Instruction cache)<br/>L2 - 16 way associative
| Modified Owner Exclusive Shared Invalid (MOESI)
|-
| Intel Pentium D
| 2
| L1 - 64 byte lines<br/>L2 - 128 byte lines
| L1 - 16 KB (data only. Instead of instruction cache, a "150KB trace cache" is used)<br/>L2 - 1MB or 2MB per core
| L1 - 4 way<br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Intel Pentium Dual Core
| 2
| L1 - 64 byte lines<br/>L2 - 64 byte lines
| L1 - 32 KB (both Data and Instruction cache)<br/>L2 - 1MB or 2MB per core
| L1 - 4 way<br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Intel Core 2 Duo
| 2
| L1 - 64 byte lines<br/>L2 - 64 byte lines
| L1 - 32 KB (each for Data and Instruction cache)<br/>L2 - 2MB or 4MB
| L1 - 4 way<br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Broadcom SiByte SB1250
| 2
| L1 - 32 byte lines<br/>L2 - 32 byte lines
| L1 - 32 KB (a piece for Data and Instruction caches)<br/>L2 - 512KB
| L1 - 2 way<br/>L2 - 4 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Sun Microsystems UltraSPARC IV
| 2
| L1 - 128byte lines<br/>L2 - 128 byte lines
| L1 - 64KB data, 32KB instruction<br/>L2 - up to 16MB
| L2 - 2 way
| Modified Owner Exclusive Shared Invalid (MOESI)
|-
| IBM Cell Processor
| 2
| Not Available
| L1 - 32 KB (a piece for both data and instruction caches)<br/>L2 - 512KB
| L1 - 2 way instruction, 4 way data<br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|-
! colspan="6" style="background:#ffdead;" | Singlecore Processors
|-
| AMD Athlon 64
| 2
| L1 - 64 byte lines<br/>L2 - 64 byte lines
| L1 - 64 KB (each for Data and Instruction cache)<br/>L2 - 512KB
| L1 - 2 way<br/>L2 - 16 way
| Modified Owner Exclusive Shared Invalid (MOESI)
|-
| AMD K6 / K6 III
| 2
| L1 - 32 byte lines<br/2>L2 - 32 byte lines
| L1 - 32KB data, 32KB instruction<br/>L2 - 256KB
| L1 - 2 way<br/>L2 - 4 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Intel Pentium 4
| 2
| L1 - 64 byte lines<br/>L2 - 128 byte lines
| L1 - 8 KB (data only. Instead of instruction cache, a "150KB trace cache" is used))<br/>L2 -256KB, 512KB or 1MB
| L1 - 4 way<br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|-
| Intel PentiumIII 600
| 2
| L1 - 32 byte lines<br/>L2 - 32 byte lines
| L1 - 16 KB data, 16KB Instruction<br/>L2 - 256KB
| L1 - 4 way <br/>L2 - 8 way
| Modified Exclusive Shared Invalid (MESI)
|}
|}


=== ARM ===
= Conclusion =
MPCore container for ARM9 & ARM11 High-performance embedded and entertainment
Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity, although we can have many other sizes and associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol.
 
=== Broadcom ===
* SiByte SB1250
Two scalable MIPS core, MESI
L1: 32K data + 32K instruction
Cache block 32 bytes
Cache line 32 bytes
L1 to L1 latency : 28~36 cycles
 
L2: 512K shared, ECC, 4way associative
32 bytes cache line
 
* SB1255
L1: 32K data + 32K instruction
Cache block 32 bytes
Cache line 32 bytes
L1 to L1 latency : 28~36 cycles
 
L2: 512K shared, ECC, 4way associative
32 bytes cache line
 
* SB1455
L1: 32K data + 32K instruction
 
L2: 1MB shared, 8way associative, ECC protected
Cradle Technology CT3400
CT3600 Multi-core DSP
CT3400
8 32bits DSPs, 6 RISC-like CPUs
32K instruction cache, 64K local data memory
 
=== CT3600 ===
2 quad DSPs, 8 DPS per quad, 32KB instruction cache per quad, 125K data memory per quad
 
=== Cavium Networks ===
* Octeon
16 MIPS cores
CN38XX/CN36XX 4 to 16 MIPS64 cores
1M ECC protected shared L2 (CN38XX), 512K (CN36XX)
32K instruction cache/8K data cache/2K write buffer per every MIPS core
 
=== IBM ===
* Cell
In the PlayStation 3
PowerPC based
8 cores optimized for vector operation
Power4 1st dual core
2000
 
* Power4
L1: 64K per CPU instruction cache(128byte line, dirrect map, LRU)
    32K per CPU data cache (2way, 128byte line, LRU)
L2: 1440K, 8way, 128 byte line
L3: 128M, 8way 512 byte line


* Power4+
From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.
L1: data 32K(2way set), instruction 64K(directly mapped)
L2: 3 x 0.5M shared by dual core, 40B/cycle per port
L3: 32M


* Power5
= References =
Dual core
L1: I-64K(2way LRU), D-32K(4way, LRU)
L2: three 0.625M, 10way, LRU, 128 byte line
L3: 36M, 12way
 
* PowerPC 970MP Dual
In the Apple PowerMac
L1: 32K data(2 way associative with parity protection), 64K instruction (directly mapped)
 
L2: 1M per core, ECC
Cache-coherency snooping protocol
 
=== Intel ===
* Core2 Quad
 
* Xeon Quad core
12/13/2006
 
* Xeon 5000
L1: 16KB (Data cache per core)+ 12KµOPS(Trace cache per core)
L2: 2MB per core
 
* Xeon MP7000
L1:16KB (Data cache per core)+ 12K µOPS (Trace cache per core)
L2: 1MB per core or 2x2MB per core
 
* Xeon MP7300
L1: 32K data per core, 32K instruction per core
L2: up tp 8MB, snoop filter, 4M shared L2 per die
Core Duo
* Core2 Duo
Xeon (x1xx) Dual core
 
* Xeon 5100
L1 : 32KB (Data) + 32KB (Instruction) per core
L2 : 4MB (Shared)
 
* Xeon 7100
L1: 16K data
L2: 2M 8way ECC
L3: up to 16M ECC. 16way
 
* Itanium 2
Multi-core
Montecito
L1:32KB
L2:256KB
L3:up to 9MB
 
=== PARISC ===
* PA8800
L1: 1.5M data 4way set, 1.5M instruction 4way set, 2cycle
L2: 32M
 
=== Stream Processors ===
* Strom-1 fmaily
40 to 80 ALUs
 
SP16HP-G220, SP16-G160, SP8-G80
L1: 32K data, 16K instruction
96K VLIW instruction memory
 
=== Sun Microsystems ===
* UltraSPARC IV
dual
L1: 64K data, 32K instruction
L2: up to 16MB, external 8M 2way set associative per core
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC
write cache: hash-indexed 2K,
2K prefetch cache
 
* UltraSPARC IV+
L2: 2M on-chip
L3: 32M external
 
* UlatraSPARC T1
8 cores
L1: 16K instruction cache per core, 4way-set associative parity protected
8K data cache , parity protected, 4way set associative
 
L2: 3M 12way, 4banks, ECC
 
== Cache size and characteristics of single-core processors ==
 
=== AMD ===
 
* Athlon 64
dual
L1: 64K data, 32K instruction
L2: up to 16MB, external 8M 2way set associative per core
Cache line sizes changed 512 to 128 bytes to reduce data contentionassociated with sub-blocked cache, LRU replacement policy, ECC
write cache: hash-indexed 2K,
2K prefetch cache
 
== Conclusion ==
 
== References ==
[1] http://www.amd.com/us-en/Processors/ProductInformation
[1] http://www.amd.com/us-en/Processors/ProductInformation


Line 212: Line 131:
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor
[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor


[4] http://www.intel.com/products/processor/xeon7000/documentation.htm?iid=products_xeon7000+tab_techdocs#datasheets
[4] http://www.intel.com/products/processor


[5] http://www.sun.com/processors/UltraSPARC-IV/
[5] http://www.sun.com/processors/UltraSPARC-IV/
Line 227: Line 146:


[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html
[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html
[12] http://en.wikipedia.org/wiki/Cell_microprocessor
[13] http://techreport.com/articles.x/8236/2
[14] http://www.hardwaresecrets.com/article/481/9

Latest revision as of 02:01, 29 September 2007

Introduction

Several cache technology which can expedite the speed of processing are used for modern processors over memory-CPU gap. Since the cache structure itself can affect the performance of the cache, to choose an appropriate structure is an important and a hard to solve problem. For example, generally bigger cache shows better performance. However due to cache pollution, the performance shows diminishing returns as the cache size goes bigger. Thus we have to choose an appropriate cache size. From this point, it might be valuable to look through the cache chracteristics of modern processors.

The cache structure can be determined by a few parameters such as cache size, replacement algorithm and associativity, and cache line size. While multi-core processors are introduced, the cache coherency also becomes issue and the coherency protocol such as MESI and MOESI affects the performance. In this psage, several cache parameters will be shown for modern multicore processors as well as for a couple of single-core processors.

Cache sizes in multicore architectures

Topic - Create a table of caches used in current multicore architectures, including such parameters as number of levels, line size, size and associativity of each level, latency of each level, whether each level is shared, and coherence protocol used. Compare this with two or three recent single-core designs.


Detail of Caches
Multicore Processors
Processor Name Number of Levels Line Size Cache Size Associativity Coherence Protocol
AMD Athlon 64 X2 2 64 bytes (for both L1 & L2) L1 - 64KB (Data) + 64KB (Instruction) per core
L2 - 512KB to 1MB per core
L1 - 2 way (Data and Instruction cache)
L2 - 16 way associative
Modified Owner Exclusive Shared Invalid (MOESI)
AMD Athlon 64 FX 2 64 bytes (for both L1 & L2) L1 - 64KB (Data) + 64KB (Instruction) per core
L2 - 1MB per core
L1 - 2 way (Data and Instruction cache)
L2 - 16 way associative
Modified Owner Exclusive Shared Invalid (MOESI)
AMD Athlon Opteron
(marketed for servers)
2 64 bytes (for both L1 & L2) L1 - 64KB (Data) + 64KB (Instruction) per core
L2 - 1MB per core
L1 - 2 way (Data and Instruction cache)
L2 - 16 way associative
Modified Owner Exclusive Shared Invalid (MOESI)
Intel Pentium D 2 L1 - 64 byte lines
L2 - 128 byte lines
L1 - 16 KB (data only. Instead of instruction cache, a "150KB trace cache" is used)
L2 - 1MB or 2MB per core
L1 - 4 way
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)
Intel Pentium Dual Core 2 L1 - 64 byte lines
L2 - 64 byte lines
L1 - 32 KB (both Data and Instruction cache)
L2 - 1MB or 2MB per core
L1 - 4 way
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)
Intel Core 2 Duo 2 L1 - 64 byte lines
L2 - 64 byte lines
L1 - 32 KB (each for Data and Instruction cache)
L2 - 2MB or 4MB
L1 - 4 way
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)
Broadcom SiByte SB1250 2 L1 - 32 byte lines
L2 - 32 byte lines
L1 - 32 KB (a piece for Data and Instruction caches)
L2 - 512KB
L1 - 2 way
L2 - 4 way
Modified Exclusive Shared Invalid (MESI)
Sun Microsystems UltraSPARC IV 2 L1 - 128byte lines
L2 - 128 byte lines
L1 - 64KB data, 32KB instruction
L2 - up to 16MB
L2 - 2 way Modified Owner Exclusive Shared Invalid (MOESI)
IBM Cell Processor 2 Not Available L1 - 32 KB (a piece for both data and instruction caches)
L2 - 512KB
L1 - 2 way instruction, 4 way data
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)
Singlecore Processors
AMD Athlon 64 2 L1 - 64 byte lines
L2 - 64 byte lines
L1 - 64 KB (each for Data and Instruction cache)
L2 - 512KB
L1 - 2 way
L2 - 16 way
Modified Owner Exclusive Shared Invalid (MOESI)
AMD K6 / K6 III 2 L1 - 32 byte lines
L2 - 32 byte lines
L1 - 32KB data, 32KB instruction
L2 - 256KB
L1 - 2 way
L2 - 4 way
Modified Exclusive Shared Invalid (MESI)
Intel Pentium 4 2 L1 - 64 byte lines
L2 - 128 byte lines
L1 - 8 KB (data only. Instead of instruction cache, a "150KB trace cache" is used))
L2 -256KB, 512KB or 1MB
L1 - 4 way
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)
Intel PentiumIII 600 2 L1 - 32 byte lines
L2 - 32 byte lines
L1 - 16 KB data, 16KB Instruction
L2 - 256KB
L1 - 4 way
L2 - 8 way
Modified Exclusive Shared Invalid (MESI)

Conclusion

Most of processors introduced nowadays have 32K or 64K data/instruction L1 cache which have 2, 4 or 8 set-associativity, although we can have many other sizes and associativity. The cache lines are either 64 or 128 bytes. MESI and MOESI are the prevailed cache coherency protocol.

From the above table we find that there isn't much difference in the specifications of caches used in multi-core and single-core processors.

References

[1] http://www.amd.com/us-en/Processors/ProductInformation

[2] http://www.broadcom.com/products/Enterprise-Networking/Communications-Processors/BCM1250

[3] http://www-01.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_970MP_Microprocessor

[4] http://www.intel.com/products/processor

[5] http://www.sun.com/processors/UltraSPARC-IV/

[6] http://www.sun.com/processors/UltraSPARC-IV+/

[7] http://www.sun.com/processors/UltraSPARC-T1/specs.xml

[8] http://www.streamprocessors.com/streamprocessors/Home/Products/Storm-1Family.html

[9] http://www.netlib.org/utk/papers/advanced-computers/pa-risc.html

[10] http://www.netlib.org/utk/papers/advanced-computers/power4.html

[11] http://www.netlib.org/utk/papers/advanced-computers/power5.html

[12] http://en.wikipedia.org/wiki/Cell_microprocessor

[13] http://techreport.com/articles.x/8236/2

[14] http://www.hardwaresecrets.com/article/481/9